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[Text] 

UDC  621.372.542 

V.A.  Ivanov,  "Algorithms  Generating  a  Series  of  Nonstationary  Images  Based  on 
a  Mosaic  Model,"  AVTOMETRIYA,  No  6,  1988. 

The  objective  is  to  simulate  a  series  of  nonstationary  images  needed  for 
simulation  of  a  broad  battery  of  image  types  and  to  validate  and  test 
algorithms  processing  image  series.  A  mosaic  model  is  validated  as  a  means 
for  generation  of  the  microstructures  of  images  simulated.  Algorithms  are 
developed  which  synthesize  images  on  a  mosaic  with  random  boundaries,  move 
mosaic  unit  cells  in  coordinates  and  vary  brightness  to  form  a  series  of 
nonstationary  images.  A  software  package  based  on  this  algorithm  has  been 
created  which  is  run  on  an  E-79  computer. 

UDC  519.24 

V.G.  Alekseyev,  "Choice  of  Evaluation  Parameters  for  a  Regression  Curve  by  a 
Crossverification  Methods,"  AVTOMETRIYA,  No  6,  1988. 

A  crossverification  method  is  described  which  is  used  for  nonparametric 
evaluation  of  a  regression  curve  that  depends  on  two  functional  and  two 
numeric  parameters. 

UDC  621.391 

Yu. A.  Kapitonov,  Yu. I.  Palagin  and  A.S.  Shalygin,  "Reconstructing  the 
Distribution  Density  of  Random  Fields  and  Processes  From  Experimental  Data  by 
Using  Kernel  Estimates,"  AVTOMETRIYA,  No  6,  1988. 

Rosenblatt-Parsen  kernel  estimates  are  employed  to  reconstruct  distribution 
densities  of  random  fields  and  processes.  The  suitability  of  an  asymptotic 
theory  of  independent  samples  is  investigated.  The  effect  of  correlation 
characteristics  of  fields  and  processing  parameters  on  estimate  errors  is 
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analyzed.  A  comparison  with  the  method  of  histograms  is  made  and 
illustrations  are  given. 

UDC  621.391.2 

A.S.  Danyayev  and  V.N.  Prokofyev,  "Invariant  Algorithms  of  Automatic 
Detection  of  'Signal  Trajectories,"'  AVTOMETRIYA,  No  6,  1988. 

Detection  of  a  signal  in  a  specified  frequency-time  (or  space-time)  domain  is 
studied  under  the  assumption  that  the  unknown  "location"  of  the  signal  in  the 
domain  need  not  be  constant  during  the  time  of  analysis  (it  forms  a 
trajectory)  and  that  the  signal  and  noise  levels  not  known  a  priori  can  also 
vary  in  the  analysis  domain.  Invariant  algorithms  are  proposed  which  detect 
and  identify  the  location  of  the  "signal  trajectory,"  which  can  be 
implemented  in  automatic  devices;  the  efficiency  of  the  algorithms  is  noted. 

UDC  621.397.2:519.685 

I.M.  Bokshteyn,  "Component  Transformation  Method  Used  to  Reduce  Redundancy  in 
Color  Images,"  AVTOMETRIYA,  No  6,  1988. 

Possible  applications  of  the  component  transformation  method  with 
interpolation  in  readings  are  evaluated  as  a  way  to  reduce  redundancy  of 
multigradational  color  images.  A  version  is  suggested  which  ensures  a  high 
degree  of  compression  (up  to  2.62  bits/reading)  with  a  high  reconstruction 
quality.  An  optimal  distribution  of  the  number  of  digitization  levels  with 
respect  to  the  components  is  found.  The  results  of  a  simulation  are 
described  which  tested  the  capabilities  of  the  method  on  processing  of  real 
color  images.  The  test  confirmed  the  efficiency  of  the  component 
transformation  method. 

UDC  681.327.68:778.38 

V.A.  Dombrovskiy,  S.A.  Dombrovskiy  and  Ye.F.  Pen,  "Reliability  of  Information 
Reading  in  a  Holographic  Memory  Channel  With  Constant  Parameters," 
AVTOMETRIYA,  No  6,  1988. 

Calculated  data  reading  reliability  in  holographic  memory  based  on  an 
assumption  of  normal  distribution  and  a  generalized  Rayleigh-Rice 
distribution  of  image  powers  "1"  and  "0"  is  compared  with  the  reliability 
measured  experimentally.  The  reading  reliability  is  investigated  as  a 
function  of  the  photometric  characteristics  and  photoscanning  regime. 

UDC  517.44:621.391.1:681.3.012 

O.Ye.  Baklanova,  M.V.  Zyuzin  and  A.V.  Lyulyakov,  "Implementation  of  Parallel 
Digital-Filtering  Algorithms  on  the  PS-2000  Multiprocessor  Computation 
Complex,"  AVTOMETRIYA,  No  6,  1988. 

Algorithms  for  the  digital  filtering  of  signals  and  images  implemented  on  a 
PS-2000  multiprocessor  computation  complex  are  described-  Calculations 
demonstrating  the  efficiency  of  the  processing  of  experimental  data  are 
reported. 
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UDC  681.3.06 

A.M.  Kovalev  and  Yu.V.  Tarasov,  "Texture  on  Arbitrarily  Oriented  Flat 
Surfaces,"  AVTOMETRIYA,  No  6,  1988. 

Mapping  a  texture  file  onto  an  arbitrarily  oriented  plane  in  a 
three-dimensional  space  is  studied.  A  method  of  mapping  is  proposed  that  is 
suitable  for  hardware  implementation  and  can  be  used  in  multiprocessor 
real-time  computer  graphics  systems. 

UDC  535.853.23:578.08 

L.A.  Avramov,  V.N.  Verkhoturov,  V.V.  Gorokhov,  A. I.  Komarov,  B.P. 

Korvatovskiy,  A. A.  Lazarev,  V.Z.  Pashchenko,  A.Ya.  Pikulenko  and  A.B.  Rubin, 
"Computerized  Absorption  Spectrometer  for  Kinetic  and  Spectral  Studies  in  the 
Picosecond  Time  Range,"  AVTOMETRIYA,  No  6,  1988. 

A  computerized  absorption  spectrometer  is  described  designed  for  kinetic  and 
spectral  studies  in  the  picosecond  time  range.  The  software  for  the 
experimental  installation  is  described.  The  capabilities  of  the  unit  have 
been  demonstrated  in  a  test  measuring  the  variation  kinetics  of  optical 
absorption  with  a  wavelength  652  nm. 

UDC  681.3:519.246.8 

D.N.  Zabiyaka,  A. A.  Predtechenskiy  and  A. I.  Chernykh,  "Spectrum  Analyzer 
Based  on  an  Elektronika  60  Microcomputer,"  AVTOMETRIYA,  No  6,  1988. 

A  narrow-band  digital  spectrum  analyzer  is  described  for  signals  in  the 
sonic— frequency  range  (up  to  16  kHz)  and  the  infrasonic  range.  The  analyzer 
is  based  on  the  DVK-2  and  Elektronika  60M  microcomputers  without  any 
structural  modifications.  A  software  implementation  of  the  periodogram 
method  and  of  related  spectrum-analysis  algorithm  is  capable  of  real-time 
signal  processing  in  a  band  of  up  to  250  Hz. 

UDC  621.39.1:621.378:532.57 

V.A.  Zhmud,  "Servoprocessor  for  LDIS  [not  further  identified]  Signals  in 
Backscattering  Mode,"  AVTOMETRIYA,  No  6,  1988. 

A  new  system  is  described  which  processes  LDIS  signals  of  a  hydraulic  flow  in 
backscattering  mode.  The  core  element  of  the  system  is  a  servoprocessor 
featuring  high  accuracy,  operation  speed  and  noise  immunity  due  to  a  new 
memory  circuit,  a  combination  of  monitor  and  frequency  signal  quality 
criteria  and  introduction  of  an  additional  acquisition  criterion. 

UDC  681.335.2 

L.K.  Samoylov  and  G.I.  Tkachenko,  "Data  Collection  Program  Generation  in 
Information  and  Measurement  Systems,"  AVTOMETRIYA,  No  6,  1988- 
Various  algorithms  forming  data  collection  programs  and  their  shortcomings 
analyzed.  New  algorithms  are  described  which  ensure  a  higher  efficiency  in 
utilizing  the  communication  channel  throughput  and  allow  placing  the 
appropriate  number  of  synchronization  pulses  in  a  cycle  in  neighboring 
intervals.  A  block  diagram  of  the  computer  algorithm  is  given  which  forms 
the  effective  data  collection  program  for  an  arbitrary  set  of  desired  readout 
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frequencies  under  the  assumption  that  synchronization  pulses  are  placed  in 
neighboring  intervals. 

UDC  535.4:778.38 

I.G.  Palchikova,  "Kinoform  Conoid  Axicons,"  AVTOMETRIYA,  No  6,  1988. 

A  new  method  is  proposed  for  calculating  the  characteristics  of  kinoform 
axicons  which  can  produce  a  desired  intensity  distribution  along  the  focal 
segment.  Experimental  studies  of  axicon  caustics  are  reported. 

UDC  681.327.68:778.38 

V.K.  Yerokhovets,  "Estimating  the  Tolerances  of  Linear  Positioning  of 
Components  in  Documentary  Graphic  Memories,"  AVTOMETRIYA,  No  6,  1988. 

In  general  mathematical  terms,  the  tolerances  of  linear  positioning  of  the 
components  of  a  graphic  memory  are  estimated  (a  transparency  with  initial 
data,  point  light  sources  in  reference  and  reading  channels  of  the  memory)  in 
terms  of  sharpness  and  accuracy  of  the  format  reproduced. 

UDC  621.396:535.8 

S.Yu.  Bondartsev,  N.A.  Yessikina  and  A.P.  Lavrov,  "Optical  Processors  With 
Charge-Coupled  Photodetectors,"  AVTOMETRIYA,  No  6,  1988. 

Optical  processors  with  scanning  CCD  photodetectors  are  discussed.  The 
processors  are  shown  to  be  a  promising  technology  for  analysis  of  various 
types  of  radio  signals.  A  study  of  an  optical  correlator  with  a  reference 
mask  has  been  conducted.  It  generates  crosscorrelation  functions  of  a  video 
signal  and  a  library  of  reference  signals  written  in  the  mask.  The 
correlator  is  shown  to  be  a  highly  productive  optical  processor  which 
accomplishes  vector-matrix  multiplication.  The  paper  also  describes  an 
acoustooptical  device  processing  signals  of  pulsars  (acoustooptical 
dispersion  compensator) ;  with  the  aid  of  a  scanning  CCD  photodetector  pulsar^ 
signals  are  condensed. 

UDC  681.7.013.82 

V.I.  Kozik,  A.N.  Oparin  and  O.I.  Potaturkin,  "A  Study  of  the  Characteristics 
of  a  Holographic  Correlation  Conjugated  With  a  Video  Processor,"  AVTOMETRIYA, 
No  6,  1988. 

An  optoelectronic  system  is  described  which  consists  of  a  holographic 
intensity  correlator  [GKI]  with  a  quasimonochromatic  CRT  and  a  universal 
digital  video  processor  controlled  by  a  microcomputer  and  coupled  with  the 
correlator  in  input  and  output.  The  influence  is  studied  of  the  parameters 
of  the  optical  GKI  system  and  the  quasimonochromatic  CRT  upon  the  resolution 
and  transmittance  of  the  correlator.  The  recognition  efficiency  of  real 
objects  with  digital  image  preprocessing  algorithms  is  evaluated.  Results  of 
an  experimental  study  are  reported. 

UDC  621.391 

V.G.  Getmanov,  "Reducing  the  Time  of  Solution  of  Discrete  Trigonometric 
Approximations,"  AVTOMETRIYA,  No  6,  1988. 
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Approximation  of  narrow-band  processes  by  a  modeling  sinusoidal  function  is 
studied.  The  residual  sum  of  squares  in  the  frequency  region  is  registered 
by  using  a  discrete  Fourier  transform  of  the  initial  process  and  the  model 
function.  This  helps  reduce  substantially  the  model  parameter  calculation 
time  compared  with  approximations  in  the  time  region. 

UDC  381.3.06 

A. A.  Yantimirov,  "Software  for  Raster  Graphics  Devices,"  AVTOMETRIYA,  No  6, 
1988. 

Algorithms  of  a  standardized  basic  software  package  that  implements  the 
graphics  function  independent  of  the  type  of  hardware  are  described.  The 
program  draws  polygons  and  parts  of  circles,  filling  them  with  patterns. 
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GENERAL 


Soviet  Personal  Computers 

907G0051A  Moscow  NOVOYE  V  ZHIZNI,  NAUKA,  TEKHNIKE:  VYCHISLITELNAYA 
TEKHNIKA  I  EYE  PRIMENENIE  in  Russian  No  11,  Nov  1989  pp  22-29 

[Full  translation  of  article  "Personal  Computers"  by  A.  A.  Stanishevskiy] 

[Text]  The  first  personal  computer,  i.e.,  the  first  personal  microcomputer,  appeared  in  1976  when 
the  first  "Apple"  computer  was  manufactured.  The  birth  of  the  first  microcomputer  was  also 
accompanied  by  the  development  of  a  new  terminology  which  is  often  borrowed  from  the  "big 
brothers"  —  mainframe  and  minicomputers,  and  this  terminology  has  to  some  degree  followed  the 
development  of  the  new  "world"  of  personal  computers. 

One  journal  for  beginner  users  of  computer  technology  has  explained  that  since  the  central  processor 
is  based  on  an  integrated  circuit  —  a  microprocessor  —  the  computer  itself  is  called  a  microcomputer. 

An  article  in  Literaturnaya  Gazeta  on  this  subject  has  noted  quite  accurately  that  in  order  to  be  a 
personal  computer  a  computer  must  be  bought  in  a  store,  brought  home,  and  used  independently. 

However,  it  would  be  more  accurate  to  speak  of  microprocessor  technology  rather  than  the 
microcomputer  since  it  is  microprocessor  technology  that  is  based  on  microprocessors.  Since 
microprocessors  are  installed  and  used  not  only  in  computer  technology  but  also  in  a  wide  variety 
of  domestic  equipment  and  industrial  engineering  a  microprocessor-based  computer  installed  in  such 
equipment  will  be  called  an  embedded  microcomputer.  Such  a  microcomputer  is  characterized  by  a 
design  whereby  both  its  tasks  and  functions  are  predetermined  and  are  limited  to  its  future 
application.  All  required  software  is  "woven"  in  advance  in  the  RAM  of  this  computer  whose  tasks 
involve  performing  any  of  a  variety  of  industrial  processes. 

A  microcomputer  that  is  used  for  domestic  purposes,  for  teaching  students,  for  games,  and  which 
has  a  limited  RAM  that  ordinarily  does  not  exceed  64  Kbytes  without  advance  peripherals  and  is 
contained  within  a  single  unit  plus  a  keyboard  will  be  referred  to  as  a  home  computer.  The 
"Elektronika  BKOOlO",  "Mikrosha"  and  "Radio  RK86"  microcomputers  are  the  domestic  represents 
of  such  computers. 

A  microcomputer  whose  RAM  begins  at  512  Kbytes  and  can  be  expanded  to  1  Mbyte  or  more  and 
which  has  advance  peripherals:  floppy  disk  memories,  "Winchester"  hard  disks  (less  than  10  Mbytes) 
a  printer  and  any  other  advance  peripheral  depending  on  the  field  of  application  will  be  called  a 
professional  personal  computer. 

Of  course  the  definitions  of  microcomputers  given  here  are  rather  arbitrary  since  a  home  computer 
can  also  be  used  to  run  moderately  complex  industrial  tasks,  while  a  professional  personal  computer 
can  be  used  for  training. 

The  "Elektronika  BKOOlO"  domestic  personal  computer. 

The  "Elektronika  BKOOlO"  microcomputer  became  the  first  domestic  personal  computer  that  was  in 
fact  available  for  purchase  in  stores.  Its  popularity  was  reflected  by  the  fact  that  it  could  not  be  kept 
on  store  shelves  and  many  informal  societies  of  users  of  the  "Elektronika  BKOOlO"  microcomputers 
were  formed. 

Of  course  one  of  the  reasons  for  the  polynomial  of  the  "Elektronika  BKOOlO"  computer  was  that  it 
was  the  only  computer  available  in  the  stores  and  it  was  relatively  "inexpensive":  approximately  600 
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rubles.  Another  reason  for  the  popularity  was  its  simple  design:  the  entire  computer  was  housed  in 
a  single  unit  with  a  keyboard  and  its  dimensions  were  such  that  it  could  easily  be  fit  in  a  briefcase. 

The  random  access  memory  in  the  "Elektronika  BKOOlO"  computer  was  16  Kbytes  with  an  additional 
16  Kbytes  used  for  the  screen  memory.  Several  operating  modes  could  be  used  and  the  RAM  could 
be  increased  to  28  Kbytes  to  accommodate  these. 

Such  a  RAM  capacity  is,  of  course,  quite  small,  although  it  does  permit  execution  of  simple  programs 
that  can  be  written  in  BASIC.  It  is  also  possible  to  use  finished  programs  by  inputting  these  to  the 
RAM  from  magnetic  tape.  A  variety  of  graphs  and  figures  can  be  generated  on  the  "Elektronika 
BKOOlO".  Nonetheless  the  RAM  capacity  is  quite  small. 

Therefore  in  future  models  of  the  computer  —  the  "Elektronika  BKOOH"  and  the  "Elektronika 
BKOlOO"  —  the  RAM  capacity  will  be  increased  to  128-256  Kbytes. 

The  "Agat"  personal  computer 

The  "Agat"  personal  computer  is  another  representative  of  domestic  computers  used  for  teaching 
purposes.  A  modular  design  principle  was  used  in  constructing  this  model;  this  design  makes  it 
possible  to  use  various  computer  configurations. 

The  "Agat"  computer  consists  of  a  system  unit,  a  keyboard,  a  video  monitor,  and  a  dot-matrix 
printer. 

The  6502  microprocessor  manufactured  by  MOS  technology  is  used  in  the  "Agat"  computer  as  the 
primary  element  in  the  central  processor.  The  microprocessor  utilizes  8-bit  data.  Its  clock  frequency 
is  1  MHz  with  an  address  bus  word  size  of  16  bits  which  permits  addressing  of  a  64  Kbyte  RAM. 

The  RAM  capacity  in  the  "Agat"  computer  is  32  Kbytes  and  is  expandable  to  256  Kbytes.  The  ROM 
capacity  of  the  computer  is  32  Kbytes. 

The  system  unit  of  the  computer  contains  the  mother  board,  the  auxiliary  modules,  the  power  supply, 
the  floppy  disk  memories,  and  auxiliary  expander  units.  A  maximum  of  seven  modules  can  fit  in  this 
unit. 

The  internal  interface  of  the  "Agat"  computer  is  used  to  interconnect  the  modules.  The  internal 
system  interface  of  the  computer  is  implemented  as  a  set  of  7  60-pin  connectors  on  the  mother  board. 
The  internal  system  interface  is  divided  into  a  data  bus  —  8  lines,  an  address  bus  —  16  lines,  and  a 
control  and  timing  bus. 

The  mother  board  contains  the  central  process  module,  the  RAM  module,  the  floppy  disk  memory 
controller  module,  and  the  serial  and  parallel  interface  modules. 

The  "Agat"  computer  RAM  is  divided  into  256  byte  pages.  The  instruction  set  of  the  central  processor 
is  determined  by  the  6502  microprocessor  which  is  used  to  maintain  program  compatibility  with  the 
Apple  II  and  TRS  80  foreign  personal  computers. 

The  monitor  (display)  of  the  "Agat"  computer  is  based  on  a  "Yunosf  Ts404"  domestic  television 
receiver.  The  unit  contains  an  interface  to  the  "Agat"  computer  and  bypasses  the  RF  circuit.  The 
video  monitor  makes  it  possible  to  generate  the  following  images: 

-  output  of  graphics  data:  black  and  white  image  256  x  256  pixels  and  a  128  x  128  and  64  x  64  pixel 
color  image; 
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-  for  alphanumeric  data  output:  black  and  white  image  of  30  x  64  and  a  color  image  of  32  x  32 
characters. 

Both  a  domestic  tape  recorder  for  cassette  programs  and  ES5088  floppy  disk  memories  manufactured 
by  the  Peoples’  Republic  of  Bulgaria  and  ES5089  manufactured  by  the  German  Democratic  Republic 
are  used  as  the  external  memories  in  the  "Agat"  computer.  The  floppy  disks  are  133  mm  in  diameter 
and  the  write  format  accommodates  140  Kbytes  of  data  per  magnetic  disk. 

The  system  employs  a  DlOO  dot  matrix  printer  which  produces  a  fine  Russian  font  and  Latin  font. 
The  DZM180  and  EPSON- CENTRONIX  printers  can  also  be  used. 

The  computer  keyboard  contains  26  Latin  and  31  Russian  Characters,  10  digit  keys  and  23  control 
keys. 

The  computer  utilizes  the  following  software:  the  "Agat"  DOS  operating  system,  and  a  BASIC 
language  interpreter  which  is  the  Microsoft  dialect  of  BASIC. 

The  program  monitor  is  stored  in  the  ROM  of  the  "Agat"  computer.  The  system  monitor  supports  the 
set  up  of  the  central  processor  operating  modes  ("cold  start")  and  initiation  of  the  software  routines, 
swapping  with  the  magnetic  tape  memories  (domestic  tape  recorder),  control  of  keyboard  commands, 
control  of  data  input  from  magnetic  tape,  acquisition  and  loading  of  programs  from  the  floppy  disk. 
If  no  floppy  disk  is  available  the  system  goes  to  a  "system  monitor  interaction"  mode. 

The  "Agat"  DOS  operating  system  is  designed  to  generate,  track,  and  delete  user  data  sets  (files) 
stored  on  floppy  disks.  The  DOS  can  use  three  types  of  files: 

A  -  BASIC  program; 

B  -  binary  program; 

T  -  text  program. 

The  "Agat"  computer  operating  system  can  support  up  to  10  floppy  disk  memories  (of  the  ES5089 
type). 

The  operating  system  and  software  make  the  "Agat"  computer  an  excellent  tool  for  teaching.  The 
"Agat"  computer  can  also  be  used  for  simple  data  retrieval  purposes. 


DVK:  the  interactive  computer  system 

The  DVK-2  was  one  of  the  first  domestic  personal  computers.  This'  16-bit  computer  was  based  on 
the  K1801  microprocessor.  The  DVK  computers  were  compatible  in  architecture  to  the  SM3  and  SM4 
minicomputers  as  well  as  the  PDF/ 11  minicomputers  of  foreign  manufacture. 

The  primary  specifications  of  the  DVK  personal  computer  are  as  follows: 

data  word  size  -  16  bits; 
basic  RAM  -  64  kbytes. 

The  DVK  contains  two  disk  drives  for  130  mm  floppy  disks  with  a  total  information  capacity  of  2 
X  216  Kbytes.  The  larger  208  mm  disk  drives  can  also  be  used  in  the  DVK  since  floppy  magnetic 
disks  of  this  diameter  have  been  used  in  the  SM4  minicomputer. 
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The  DVK  computer  includes  a  separate  power  supply,  and  a  system  unit  which  contains  the  central 
processor  board,  the  RAM,  the  keyboard,  and  the  display. 

In  the  DVK-2  model  an  alphanumeric  display  (a  character  display)  is  used  and  the  keyboard  contains 
64  keys;  in  the  DVK-3  model  a  graphics  display  with  a  600  x  320  pixel  resolution  is  used  together 
with  a  completely  new  keyboard  expanded  to  1 15  keys  and  is  easy  to  use. 

The  DVK-3  computer  expands  the  DVK  family;  this  computer  has  a  graphics  monochromatic  display 
and  a  new  keyboard  which  has  substantially  expanded  its  possible  applications.  There  is  also  a  DVK- 
4  model  with  a  color  display. 

What  has  been  responsible  for  the  broad  popularity  of  the  DVK  models  in  practice  and  has  at  the 
same  time  limited  their  use?  The  answer  must  be  sought  not  only  in  the  technical  design  of  the  DVK 
series,  although  this  is  important,  but  also  in  the  "software"  product,  which  accompanies  it. 

4 

As  noted  above  the  DVK  computer  models  have  an  architecture  that  is  compatible  with  the  SM4 
series  minicomputers.  Initially  this  permitted  the  use  of  mathematics  designed  for  the  SM  computer. 
The  DVK  series  operating  system  is  the  DVK  OS  which  at  the  same  time  is  an  abbreviated  version 
of  the  RAFOS  operating  system.  Some  utilize  the  RT-11  operating  system  in  its  shortened  version 
in  place  of  the  DVK  operating  system.  By  itself  using  an  identical  operating  system  on  different 
types  of  machines  is  a  positive  aspect  and  supports  compatibility  and  transferability  of  the  programs. 

However,  the  programs  designed  for  the  SM  microcomputer  largely  require  external  hard  disk 
memories,  i.e.,  a  significantly  larger  external  memory  capacity  and  a  larger  RAM  capacity.  Therefore 
for  the  SM  computer  the  RAM  capacity  is  at  least  128  Kbytes. 

By  itself  the  nucleus  of  the  operating  system  occupies  a  comparatively  small  volume  of  the  RAM, 
although  the  DVK  operating  system  is  designed  so  that  the  unnecessary  modules  in  the  operating 
system  are  loaded  when  needed  which  is  rather  difficult  with  64  Kbytes  of  DVK  RAM  which  must 
store  the  nucleus  of  the  operating  system  and  the  user  programs  simultaneous. 

Large  program  packages  that  require  substantial  RAM  space  cannot  be  used  on  the  DVK  computer. 
When  an  external  memory  is  used  the  nucleus  and  the  utilities  of  the  operating  system  occupy 
virtually  the  entire  space  on  one  diskette  (this  also  makes  it  impossible  to  use  all  the  capabilities  of 
the  operating  system),  while  nearly  all  the  space  on  the  second  diskette  is  used  for  holding  the 
language  to  which  the  editor  translates,  the  programming  language  translator,  the  linkage  editor,  and 
the  file  of  object  modules  of  standard  functions.  It  is  difficult  to  fit  this  entire  set  on  one  diskette. 
It  is  also  necessary  to  rearrange  the  diskettes  as  space  must  be  found  for  storing  new  and  existing 
programs. 

FORTRAN,  PASCAL,  and  BASIC  are  among  the  programming  languages  that  have  found  broad 
application  on  the  DVK  computer.  The  DVK  employs  an  RDB-microrelational  data  base.  LISP  and 
PROLOG  artificial  intelligence  packages  have  been  developed  together  with  graphics  packages  for 
PASCAL  and  PROLOG  operation.  All  these  capabilities  permit  use  of  the  DVK  computer  for  a  broad 
range  of  applications. 


"Elektronika  MS  0585"  personal  computer  system 

The  designers  of  the  "Elektronika  MS  0585"  personal  computer  designed  the  unit  for  operation  in 
automated  manufacturing  management  systems,  scientific  and  engineering  automation  systems  for 
producing  electronic  equipment  and  for  processing  industrial  economic  and  statistical  data. 
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This  computer  system  has,  like  all  modern  personal  computers,  a  modular  design  principle.  All  its 
functional  units  are  complete  devices  (modules)  connected  by  means  of  a  communications  cable  or 
a  system  bus. 

All  functional  units  of  the  "Elektronika  MS  0585"  computer  are  contained  within  the  single  housing 
of  the  system  unit.  The  system  unit  has  a  simple  internal  architecture.  It  contains  four  complete 
functional  units.  This  includes  a  power  supply  with  a  fan,  a  hard  disk  memory,  a  double  floppy  disk 
memory,  and  an  auxiliary  module  set.  All  four  units  are  connected  to  the  same  metallic  chassis.  The 
system  board  is  also  on  the  chassis  and  can  be  removed  using  a  lever  (the  board  can  be  extracted 
from  the  side  of  the  system  unit. 

The  layout  of  all  components  provides  easy  access  for  simple  computer  assembly  and  disassembly. 
An  additional  "Winchester"  memory  can  be  used  in  place  of  the  twin  floppy  disk  memories. 

One  feature  of  the  "Elektronika  MS  0585"  computer  is  the  system. module  which  uses  a  separate 
board.  The  system  module  contains:  the  central  processor,  the  RAM  dynamic  distribution  control, 
the  RAM  (512  Kbytes),  the  ROM  which  employs  page  organization  (4  pages  of  4  Kbytes  each),  the 
keyboard  controller,  the  peripheral  controller,  and  the  50  byte  RAM  timer.  All  devices  in  the  system 
module  are  independent  and  interconnected  by  a  bidirectional  internal  bus. 

The  system  bus  is  used  to  establish  a  high  speed  link  between  the  central  processor  and  the 
peripherals.  The  system  bus  is  connected  to  the  "Winchester"  hard  disk  memory  controller,  the  floppy 
disk  controller,  the  video  controller,  and  the  video  controller  memory  expander. 

The  screen  memory  is  32  Kbytes  and  can  be  expanded  to  96  Kbytes  by  using  the  expander.  The 
floppy  disk  memory  employs  133  mm  diskettes  and  the  recording  format  permits  storage  of  up  to 
400  Kbytes.  The  information  capacity  of  the  hard  disk  is  5  or  10  Mbytes  depending  on  configuration. 

The  "Winchester"  hard  disk  makes  the  "Elektronika  MS  0585"  unit  a  professional  computer.  The  RAM 
capacity  is  expandable  to  4  Mbytes. 

The  16-bit  address  bus  used  in  the  computer  makes  it  impossible  to  address  more  than  64  Kbytes  of 
RAM.  However,  the  memory  dispatcher  can  be  used  to  address  up  to  4  Mbytes  and  also  supports 
simultaneous  execution  of  several  processes,  i.e.,  multiprogram  operation.  The  memory  dispatcher 
permits  execution  of  tasks  regardless  of  their  location  in  the  RAM.  Each  task  is  run  in  its  own  virtual 
space  of  64  Kbytes  while  the  memory  dispatcher  provides  a  reference  between  the  logic  addresses 
and  the  physical  addresses. 

The  computer  central  processor  is  based  on  the  K181 1  microprocessor,  set.  The  "Elektronika  MS  6105" 
black  and  white  graphics  monitor  is  used  as  the  display.  The  graphics  appear  on  screen  in  dimensions 
of  640  X  320  pixels  in  16  gray  scale  gradations. 

The  system  bus  supports  two  data  swapping  modes:  program  swapping  and  program  interrupt 
swapping.  The  program  data  swapping  mode  supports  data  transmission  by  request  and  under 
program  control.  The  interrupt  data  swapping  mode  runs  programs  via  a  peripheral  program.  In  this 
mode  the  central  processor  interrupts  the  execution  of  the  present  program,  and  stores  its  present 
state  in  order  to  handle  the  device  from  which  the  interrupt  has  been  sent. 

The  power  switch  is  located  on  the  front  panel  of  the  system  unit  which  is  easy  to  use,  although  there 
is  no  "hot"  start  key,  i.e.,  there  is  no  way  to  reload  the  operating  system  without  turning  the  computer 
off. 

The  "Elektronika  MS  0585"  computer  software  includes  the  PROS  or  FODOS  operating  system. 
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The  PROS  operating  system  is  convenient  for  the  nonprofessional  user  since  it  interacts  with  the  user 
through  a  menu  system.  PROS  is  used  for  bootstrapping  and  execution  of  applied  service  programs, 
supports  the  operation  of  the  peripherals  and  the  entire  computer  as  well  as  multiprogram  operation. 
PROS  includes  a  BASIC  language  interpreter  and  a  text  editor. 

The  test  software  includes  test  programs  for  the  primary  computer  components.  The  computer  is 
tested  when  the  operating  system  is  loaded.  A  cross-section  image  of  the  computer  appears  on  the 
monitor  screen  and  when  one  of  the  units  is  shaded  this  indicates  that  the  unit  has  failed. 

The  FODOS  operating  system  supports  all  the  computer  devices.  FODOS  contains  ASSEMBLER  and 
FORTRAN  program  languages. 

Since  the  instruction  set  of  the  central  processor  of  the  "Elektronika  MS  0585"  is  identical  to  that  of 
the  SM4  and  SMI 420  minicomputers,  the  same  software  can  be  run-on  the  former  as  on  the  latter. 
In  principle  the  "Elektronika  MS  0585"  is  an  SMI 420.  Hence  the  spectrum  of  the  software  is  quite 
broad.  This  includes  the  RTll,  RSX  IIM,  OS,  RAFOS,  RV,  and  DEMOS  operating  systems  and  the 
applied  program  products  controlled  by  these  operating  systems. 

For  example,  the  DEMOS  operating  system  transforms  the  "Elektronika  MS  0585"  personal  computer 
into  a  powerful  tool  which  utilizes  the  C,  and  PASCAL  high  level  programming  languages  as  well 
as  an  advanced  catalog  system  and  the  RUBEN  high  level  relational  data  base.  The  latter  interacts 
with  other  computers  by  means  of  the  KERMIT  package. 


The  "Neuron  66"  personal  computer 

The  designers  gave  this  somewhat  poetic  name  to  their  personal  computer.  It  should  be  pointed  out 
immediately  that  this  computer  has  no  relation  to  neural  networks  or  computers  based  on  the 
principles  of  neural  nets. 

Nonetheless  the  "Neuron  66"  can  claim  the  name  of  "personal  computer".  Like  all  modern  computers 
it  is  based  on  a  modular  design  principle,  and  has  a  modern  operating  system.  The  "Neuron  66" 
computer  is  designed  to  automate  the  collection,  processing,  and  documentation  of  information  as 
an  intelligent  controller  in  computerized  instrument  systems  for  controlling  radio  instruments.  And, 
of  course,  the  "Neuron  66"  is  designed  for  automation  of  individual  efforts  in  design  and  scientific 
and  technical  activities. 

The  modular  design  principle  where  each  module  is  a  complete  device  makes  it  possible  to  easily 
construct  different  versions  of  the  "Neuron  66"  personal  computer  system.  Its  base  units  include: 

-  the  controller  (system  unit)  which  contains  the  system  microprocessor  board,  the  RAM  and  ROM, 
and  the  power  supply; 

-  the  floppy  disk  memory; 

-  the  "Winchester"  hard  disk  memory; 

-  the  video  monitor; 

-  the  keyboard. 

A  printer  and  a  plotter  can  be  connected  to  the  controller  depending  on  the  configuration. 
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A  KP  1810  VM86  16-bit  microprocessor  is  used  to  construct  a  central  processor  as  the  base  element. 
The  RAM  capacity  can  vary  from  256  to  520  Kbytes.  The  recording  format  yields  a  640  Kbyte 
floppy  disk  capacity.  The  data  capacity  of  the  hard  disk  is  5  Mbytes. 

The  system  employs  the  same  "Elektronika  MS  6105"  black  and  white  display  as  the  DVK  3M, 
"Elektronika  MS  0585"  and  ESI 840  personal  computers.  The  display  produces  a  640  x  200  pixel  mode 
in  the  graphics  mode  and  generates  25  lines  of  80  characters  each  in  the  alphanumeric  mode. 

The  computer  utilizes  a  C2  interface.  The  front  panel  of  the  controller  has  a  "reset"  button  which 
makes  it  possible  to  reload  the  operating  system  in  the  event  of  any  difficulties  (a  computer  "hang¬ 
up")  without  shutting  the  computer  down. 

The  software  for  the  "Neuron  66"  personal  computer  consists  of  system  software,  applied  software, 
and  test  software. 

The  selection  of  system  software  for  the  "Neuron  66"  is  rather  large.  Such  software  includes  the 
"Neuron  DOST',  the  "Neuron  DOS2"  systems,  and  there  is  also  program  compatibility  with  the  popular 
CPM/86  and  MSDOS  operating  systems.  It  is  clear  from  a  comparison  of  the  instruction  sets  of  the 
operating  systems  that  the  operating  systems  of  the  "Neuron  66"  computer  represents  their  analogs. 
This  makes  it  possible  to  use  the  same  software  as  used  with  MSDOS  and  CPM/86.  This  would  all 
seem  fine  although  certain  difficulties  remain. 

The  floppy  disk  write  format  used  in  MSDOS  is  a  360  Kbyte  format  which  was  also  well  understood 
by  the  designers  of  the  ESI 840  personal  computer.  The  incompatibility  of  the  floppy  disk  format 
causes  certain  difficulties,  unnecessary  time  losses  in  transferring  information  from  one  operating 
system  to  another,  and  from  one  computer  to  another.  The  512  Kbyte  RAM  and  even  more  so  the 
260  Kbyte  RAM  cannot  handle  many  modern  applied  program  packages.  The  standard  minimum 
RAM  capacity  today  is  640  Kbytes  and  the  designers  oEmodern  software  design  for  this  capacity. 

It  is  also  to  use  the  capabilities  provided  by  the  "Neuron  66"  personal  computer  which  includes  the 
BASIC  and  PASCAL  high  level  programming  languages. 


The  "Iskra  1030"  personal  professional  computer 

The  "Iskra  1030"  is  one  additional  personal  computer  that  can  fill  the  gap  where  a  minicomputer  or 
mainframe  computer  is  either  not  suitable  or  is  too  expensive. 

The  designers  of  the  "Iskra  1030"  personal  computer  designed  the  unit  for  automatic  execution  of  a 
wide  range  of  economic,  administrative  and  management  routines. 

The  modular  design  principle  was  also  used  in  the  "Iskra  1030"  personal  computer,  thereby 
simplifying  its  outfitting  and  the  connection  to  various  auxiliaries. 

The  basic  setup  of  the  "Iskra  1030"  computer  includes  five  functional  units  (modules):P 

-  processor  module  (system  unit); 

-  data  display  module  (display); 

-  keyboard  module; 
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-  printer; 

-  distributor  filter  unit. 

We  are  already  familiar  with  the  first  four  modules.  The  distributor  filter  unit  is  designed  to  filter 
the  standard  supply  voltage  and  the  voltage  to  the  first  four  modules. 

The  system  unit  (processor  module)  contains  virtually  all  primary  computer  components  within  a 
single  housing: 

-  the  system  board  containing  the  microprocessor; 

-  the  RAM  and  data  display  module  controllers; 

-  the  133  mm  floppy  disk  controllers; 

-  the  "Winchester"  hard  disk  controller; 

-  the  data  transmission  system  adapter. 

Aside  from  these  sections  the  system  unit  contains  the  floppy  disk  drives,  the  "Winchester"  hard  disk 
drive  and  the  power  supply. 

The  system  unit  therefore  is  rather  large  in  size:  485  x  455  x  200  and  it  was  then  decided  to  install 
all  primary  computer  components  within  a  single  unit  which  was  not  a  design  choice  for  the 
designers  of  the  ESI  840  and  "Neuron  66"  computers. 

The  K1810  VM86  microprocessor  used  in  the  "Iskra  1030"  operates  at  a  clock  frequency  of  4.77  MHz 
with  16-bit  data  while  a  20-bit  address  bus  is  used  for  addressing.  The  "Iskra  1030"  RAM  has  a  512 
Kbyte  capacity.  It  is  true  that  the  RAM  capacity  may  be  256  Kbytes  or  1  Mbyte  depending  on  the 
configuration. 

An  "Elektronika  MS  6105"  black  and  white  paphics  display  is  used  as  the  display;  the  video 
controller  of  this  unit  has  the  following  operating  modes: 

-  in  the  text  mode:  25  lines  of  80  characters  each  or  25  lines  of  40  characters  each; 

-  in  the  graphics  mode:  640  x  200  and  320  x  200  pixels. 

The  capacity  of  the  floppy  disks  used  in  the  "Iskra  1030"  computer  is  360  Kbytes  corresponding  to 
the  IBM  PC  standard.  The  "Winchester"  hard  disk  has  a  5  Mbyte  capacity. 

However  compatibility  with  the  IBM  PC  format  is  possible  if  the  floppy  disk  is  formatted  for  40 
tracks  corresponding  to  180  Kbytes.  This  format  which  makes  it  possible  to  write  360  Kbytes  of 
information,  while  corresponding  in  size  to  the  IBM  PC  standard,  is  in  fact  not  compatible  with  this 
standard  since  the  floppy  disks  in  the  "Iskra  1030"  computer  are  formatted  on  one  side  while  in  the 
IBM  PC  they  are  formatted  on  both  sides.  Therefore  the  floppy  disk  format  selected  and  imple¬ 
mented  by  the  designers  of  the  "Iskra  1030"  machine  provide  the  user  with  many  "free"  minutes 
during  operation.  This  is  particularly  noticeable  when  transferring  programs  from  the  IBM  PC  or 
from  the  ES1840  or  the  "Neuron  66"  computers. 

We  will  examine  this  issue  in  somewhat  greater  detail.  In  order  to  read  IBM  PC-formatted  disks  (340 
Kbytes)  on  the  "Iskra  1030"  they  must  be  formatted  by  a  format/ 1  command,  which  distributes  40 
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tracks  on  one  side  of  the  floppy  disk.  Then  the  information  written  on  one  two-sided  magnetic  disk 
must  be  rewritten  on  two  single-sided  40-track  diskettes  on  a  computer  utilizing  the  two-sided 
floppy  disk  memories. 

The  user  can  then  choose  to  employ  either  single-sided  40-track  magnetic  disks  or  to  rewrite 
information  from  these  two  disks  onto  one  single-sided  80-track  disk  on  the  "Iskra  1030"  which  is 
done  by  means  of  a  COPY  command. 

However  there  is  one  more  surprise  when  the  user  finds  out  that  since  the  ADOS  operating  system 
of  the  "Iskra  1030"  is  compatible  with  MSDOS  it  considers  the  single-sided  80-track  magnetic  disks 
to  be  two-sided  40-track  disks.  This  is  because  the  track  carrying  even  numbers  is  read  logically  by 
the  operating  system  as  side  0  of  a  two  sided  diskette  while  the  track  with  odd  numbers  is  read  as 
side  1. 

It  is  interesting  to  note  that  the  designers  decided  to  mount  the  disk  drives  for  the  floppy  disks 
vertically  rather  than  horizontally.  Users  can  decide  whether  or  not  this  is  convenient  by  using  the 
computer. 

It  is  true  that  another  design  is  quite  convenient:  the  decision  by  the  designers  to  place  on  the  front 
panel  of  the  display  the  brightness  and  contrast  controls,  and  especially  the  volume  control  which 
fans  of  computer  games  consider  necessary.  The  power  switch  is  located  on  the  back  panel  of  the 
system  unit  while,  unfortunately,  there  are  no  operating  system  reset  or  reload  switches. 

The  data  transmission  adapter  supports  interfacing  with  other  computers  on  communications  channels 
through  connectors  satisfying  the  radial  serial  interface  standards,  as  well  as  a  C2  connector.  The  data 
transmission  adapter  supports  communications  on  telephone  channels  by  means  of  a  signal  converter 
(a  modem). 

The  "Iskra  1030"  personal  computer  can  support  a  coprocessor  based  on  the  K1810  VM87  micropro¬ 
cessor  for  performing  floating  point  arithmetic  operations. 

The  8  Kbyte  computer  ROM  consists  of  four  2  Kbyte  integrated  circuits.  The  ROM  contains  the 
initialization  program  for  the  operating  system  bootstrap  routines.  The  ROM  also  contains  the  drivers 
(control  programs)  for  the  peripherals.  The  bootstrap  routine  also  tests  all  devices  in  the  personal 
computer. 

The  basic  software  for  the  "Iskra  1030"  computer  consists  of: 

-  the  operating  system; 

-  the  ADOS  programming  language; 

-  the  MASM  programming  language; 

-  the  BASIC  programming  language;  and 

-  the  YaMB  (accounting  machine  language)  programming  language. 

The  ADOS  operating  system  is  an  analog  of  the  MSDOS  operating  system.  The  instruction  mnemonics 
of  the  ADOS  operating  system  are  compatible  with  the  MSDOS  instruction  mnemonic.  The 
architecture  of  ADOS  is  analogous  to  that  of  MSDOS.  The  MASM  macroassembler  is  designed 
primarily  for  system  programmers  that  cannot  imagine  writing  programs  in  any  language  other  than 
ASSEMBLER.  The  YaMB  programming  language  is  designed  for  economic  and  statistical  applica- 
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tions.  It  is  used  in  the  software  simply  to  provide  program  compatibility  and  to  support  the  programs 
that  were  written  for  the  "Iskra  555"  and  the  "Neva  501"  accounting  machines. 


The  SM  1800  personal  computer  family 

The  SM  1800  personal  computer  family  includes  the  following  models:  The  SM  1800,  SM  1804,  SM 
1810,  and  SM  1814.  The  SM  1804  and  SM  1814  computer  models  are  dust  resistant  versions  of  the 
computer. 

The  SM  1800  personal  computer  is  based  on  the  8-bit  KR580  microprocessor  set.  The  modularity 
principle  underlies  the  design  of  the  SM  1800  computer  family,  as  with  the  computers  examined 
above.  The  141  universal  system  interface  and  the  modularity  of  the  architecture  make  it  possible  to 
integrate  the  modules  and  the  system  on  the  printed  circuit  board,  which  provides  significant 
flexibility  in  the  design  of  control  systems  and  computer  systems  and  also  provides  expandability  of 
the  computer  as  a  system.  The  expandability  property  can  be  used  for  further  expansion  and 
development  of  the  oyerall  system. 

A  new  component  base  -  the  KR  1810  microprocessor  set  -  was  used  for  the  SM  1810  computer;  this 
microprocessor  set  utilizes  16-bit  data.  The  expandability  and  modular  design  of  the  architecture 
made  it  possible  to  use  virtually  all  modules  from  the  SM  1800  in  the  SM  1810  computer  model 

The  SM  1800  computer  architecture  can  support  simultaneous  operation  of  up  to  8  central  processor 
modules,  each  of  which  contains  a  microprocessor,  a  local  RAM  and  ROM,  and  a  connection  to  the 
141  universaf  interface.  Up  to  16  such  modules  can  be  used  in  the  SM  1810  computer. 

The  K  1810  VM86  microproceissor  operates  at  a  4.77  MHz  clock  frequency  and  can  address  up  to  1 , 
Mbyte  of  RAM.  A  coprocessor  for  floating  point  arithmetic  operations  can  be  used  in  the  SM  1810 
computer. 

The  microprocessor  system  based  on  the  K  1810  microprocessor  set  and  the  I4l  universal  interface 
has  the  following  advantages  over  a  single  process  system: 

-  a  system  task  is  divided  into  a  number  of  subtasks,  each  of  which  is  executed  by  its  own  dedicated 
microprocessor; 

-  several  microprocessors  can  operate  simultaneously,  thereby  increasing  the  operating  speed  of  the 
overall  system. 

The  address  bus  in  the  SM  1810  computer  has  been  expanded  from  20  bits  (SM  1800)  to  24  bits 
which  makes  it  possible  to  address  up  to  16  Mbytes  of  RAM.  The  majority  of  modules  from  the  SM 
1800  can  be  used  in  this  personal  computer.  The  SM  1810  computer  contains  an  auxiliary  8 -bit 
module  to  establish  compatibility  between  these  modules. 

The  SM  7209  and  VTA22000  terminals  are  used  as  displays;  these  support  operation  in  an  alphanu¬ 
meric  mode  producing  25  x  80  characters. 

It  should  be  noted  that  there  is  a  broad  variety  of  software  in  the  SM  1800  computer  family.  This 
software  includes  a  general  purpose  operating  system,  a  resource  (training)  system,  real-time 
operating  systems,  and  the  BASIC  software  package. 

The  OS  1800  general  purpose  and  micro-DOS  operating  systems  contain  a  rather  broad  set  of 
programming  packages.  These  packages  are  compatible  with  the  CP/M  operating  system.  They 
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consist  of  the  basic  disk  operating  system  (BDOS),  the  basic  input-output  system  (B  I/Os)  and  the 
console  command  processor. 

The  resource  software  is  designed  to  develop  complex  programs,  largely  for  real-time  systems.  These 
include  the  DOS  1800  and  DOS  1810  systems.  The  programming  languages  here  are  BASIC, 
FORTRAN,  and  PASCAL.  The  BASIC  real-time  system  supports  multiprogram  operation. 

The  Mikros-86  operating  system  which  is  compatible  with  the  CP/M-86  operating  system,  is  used. 

The  user  has  access  to  a  wide  range  of  program  packages.  The  "Dialog"  package  makes  it  possible  to 
develop  and  utilize  an  information  retrieval  system  and  to  produce  data  bases.  The  "data  preparation" 
package  supports  data  processing  in  documents  in  any  format.  The  BnD  (data  base)  package  is  used 
to  produce  relational  data  bases.  The  "Kalendar"  package  is  used  to  supervise  management  functions. 


The  YeS  series  personal  computers 

It  is  likely  that  everyone  is  familiar  with  the  mainframe  YeS  computers.  This  family  has  also  been 
expanded  to  include  new  personal  computers.  The  YeS  1840  professional  computer  provided  access 
to  the  rich  existing  library  of  programs. 

Structurally  the  YeS  1840  computer  system  consists  of  several  functional  modules.  Above  all  this 
includes  the  system  electronic  unit,  the  floppy  disk  set,  the  display  (a  black  and  white  graphics 
display  in  the  most  recent  models)  and  a  keyboard. 

The  system  unit  is  150  x  300  x  450  mm  in  size.  It  contains  the  basic  system  board,  the  RAM  board, 
the  display  address  board,  the  floppy  disk  drive  adapter  board,  and  the  interface  adapter  board.  The 
base  system  unit  also  contains  the  power  supply  and  a  fan.  Additional  peripheral  adapter  boards  or 
an  additional  RAM  expander  board  can  be  connected  to  the  free  connectors  on  the  system  bus 
(acquiring  such  boards  is  rather  problematical). 

The  expander  unit  is  included  in  order  to  support  a  large  number  of  peripherals.  The  expander  unit 
is  based  on  the  same  structural  design  as  the  system  unit.  It  has  its  own  power  supply  and,  like  the 
system  unit,  has  a  system  bus  and  seven  connectors.  It  is  true  that  there  are  only  six  free  connectors 
since  one  is  assigned  for  connection  to  the  system  unit. 

The  "heart"  of  the  system  board  is  the  K  1810  VM86  microprocessor  which  can  handle  8-  and  16- 
bit  data  (an  analog  of  the  Intel  8088  microprocessor)  and  can  address  a  memory  of  up  to  1  Mbyte  by 
means  of  a  20-bit  address.  However  the  RAM  capacity  available  in 'the  basic  configuration  is  only 
512  Kbytes  (you  will  soon  find  out  whether  or  not  this  is  enough  when  using  the  computer).  The 
microprocessor  contains  a  6  byte  buffer  to  increase  information  handling  speed,  which  makes  it 
possible  to  select  the  next  commands  while  the  microprocessor  is  handling  the  present  command. 

The  computer  ROM  has  an  8  Kbyte  capacity.  Computer  test  programs  are  contained  here;  these  are 
executed  when  the  computer  is  turned  on.  The  disk  drives  used  in  the  YeS  1840  are  designed  for  130 
mm  floppy  disks  and  write  in  the  360  Kbyte  format,  i.e.,  2  x  360  Kbytes.  This  is,  of  course,  small 
and  therefore  a  program  is  available  to  expand  the  information  capacity  of  each  floppy  disk  to  720 
Kbytes.  This  is  possible  as  each  magnetic  disk  is  considered  to  be  two  disks,  which  maintains 
compatibility  with  the  IBM  PC  standard. 

In  the  early  YeS  1840  models  an  alphanumeric  display  was  used  to  produce  25  lines  of  80  characters 
each.  However  it  was  subsequently  replaced  by  a  black  and  white  graphics  display  (the  same  unit 
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used  in  the  DVK-3  and  in  the  "Elektronika  MS  0585").  The  operating  modes  make  it  possible  to 
obtain  up  to  16  gray  shades  and  a  maximum  resolution  of  640  x  200  pixels. 

The  YeS  1840  keyboard  employs  a  planar  design  structure  containing  92  keys  with  a  key  configura¬ 
tion  similar  to  that  of  the  IBM  PC. 

The  floppy  disks  are  housed  in  a  separate  unit  which  of  course  complicates  the  layout  of  the 
computer  on  a  desk.  The  unit  contains  two  stacked  memories.  An  independent  power  supply  is  used 
so  that  in  switching  on  the  computer  the  wisdom  of  the  designers  is  often  recognized  since  the  power 
supply  switches  are  located  on  the  rear  panel  of  both  the  system  unit  and  the  memories.  No  hot  start 
key  exists  either.  In  most  cases  it  is  possible  to  reload  the  operating  system  by  simultaneously  keying 
the  CON/ ACC/REM  keys  [UPR/DOP/UDL]  although  it  is  possible  to  determine  soon  enough 
whether  or  not  this  is  insufficient. 

One  additional  inconvenience  is  the  design  of  the  cables  and  connectors  used  in  the  computer.  A 
screwdriver  is  reQuired  to  connect  different  units  while  interconnecting  cables  are  reQuired  to  cover 
the  substantial  space  for  containing  such  units. 

Auxiliary  modules  have  been  developed  to  expand  the  capabilities  of  the  YeS  1841.42  computer:  a 
discrete  signal  input-output  module,  an  analog  signal  input-output  module,  a  CAMAC  system 
interface  module,  and  a  digital  speech  synthesizer. 

An  "Estafeta"  network  adapter  and  a  YeS  7920  system  adapter  module  have  been  developed  for 
establishing  local  area  networks.  The  C2  connection  adapter  (an  analog  of  the  RS232  permits 
connection  to  a  remote  data  processing  network  based  on  modules  of  the  mainframe  YeS  computers 
through  multiplexers  and  modems.  The  YeS  7920  system  adapter  will  ppport  connection  of  the  YeS 
family  personal  computer  to  the  7920  system  as  one  of  the  terminals  of  this  system,  thereby 
providing  access  to  the  resources  of  a  mainframe  YeS  computer.  The  adapter  of  the  "Estafeta"  local 
area  network  can  be  used  to  set  up  ring-type  local  area  regional  computer  networks.  The  "Estafeta" 
network  can  interconnect  up  to  125  personal  computers  with  a  maximum  distance  of  up  to  125  km 
between  computers. 

So  you  have  already  installed  a  YeS  1840  personal  computer  (and  possibly  an  1842)  at  your 
workstation,  have  switched  on  the  power  and  have  started  working.  What  basic  operating  system  are 
you  to  use?  The  manufacturer  has  provided  with  the  standard  YeS  computer  set  the  M86  operating 
system  and  a  set  of  applied  program  packages.  The  commands  in  the  M86  operating  system  have 
Russian  notation  and  the  program  packages  will  "speak"  with  you  in  Russian.  It  is  probably  best  if 
you  carefully  review  program  documentation  at  a  convenient  site  with  the  diskettes  on  which  the 
software  is  written.  You  can  then  acquire  the  MSDOS  from  your  friends  or  through  official  channels. 
You  ask  why? 

Such  a  favorable  operating  system  as  the  M86  is  a  close  brother  to  the  CP/M- 80  operating  system. 
However,  this  is  in  the  past. 

Program  compatibility  with  the  IBM  PC  and  its  MSDOS  operating  system  in  fact  will  make  the  YeS 
1840  such  a  powerful  tool  that  it  becomes  indispensable.  The  MSDOS  will  allow  the  user  access  to 
all  such  software  currently  available.  You  can  use  such  modern  program  interfaces  as  PCTOOLS  and 
NORTON  COMMANDER.  You  will  have  access  to  a  wide  variety  of  modern  programming  languages 
from  BASIC  through  PROLOG,  data  base  management  systems  and  integrated  tools  packages.  So  do 
not  waste  your  time  on  M86  and  begin  immediately  with  MSDOS. 

One  additional  comment.  The  capabilities  provided  by  modern  operating  systems  such  as  MSDOS  and 
their  software  require  substantial  RAM  capacity  and  a  hard  disk.  Therefore  you  will  soon  find  out 
that,  unfortunately,  the  YeS  1840  computer  does  not  permit  utilization  of  all 
tools  provided  by  the  software. 
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GENERAL 


THE  VECTOR  PROCESSOR  OF  THE  ELBRUS -2  MULTIPROCESSOR  COMPUTING  COMPLEX 
907G0089A  Moscow  SUPER-EVM  in  Russian,  pp.  1-29 

[Article  by  V.  S.  Burtsev,  Ye.  A.  Krivosheyev,  V.  D.  Asriyeli,  P.  V.  Borisov 
and  K.  Ya.  Tregubov] 

[Text]  1.  Principles  of  Design  of  the  Vector  Processor  and  Its  Operation 
Within  the  Elbrus -2  MCC 

Development  of  the  Elbrus -2  MCC  was  mainly  oriented  at  achieving  utmost 
performance  of  the  complex  in  solving  complicated  scientific  data  processing 
and  computing  tasks  in  a  collective  use  modality  and  real  time  problems .  The 
work  was  carried  out  in  the  following  directions: 

-  achieving  utmost  performance  of  the  universal  central  processors  by  having  a 
parallel  (in  time)  computing  process  by  dynamic  allocation  of  resources  such 
as  actuating  devices,  the  fast  registers  of  the  scratch-pad  memory,  and  so 
forth; 

-  use  of  a  modular  multiprocessor  architecture  with  dynamic  allocation, 
especially  allocation  of  the  modules  of  central  processors,  memory,  I/O 
processors ,  and  exchange  channels  with  the  external  memory  and  the  external 
device  modules ; 

-  widespread  use  of  specialized  processors  in  the  multiprocessor  system, 
including  first  and  foremost  a  vector  processor. 

In  order  to  understand  the  features  of  using  specialized  vector  processors 
alongside  the  universal  central  processors  in  a  multiprocessor  architecture, 
let  us  consider  the  basic  principles  of  their  functioning,  especially  as  it 
enables  a  faster  computing  process. 

It  must  be  mentioned  that  a  specialized  vector  processor  is  capable  of 
hastening  a  computing  process  that  represents  a  parallel  graph  of 
non- interconnected  homogeneous  operations  on  data.  Any  given  operation  on 
data  (addition,  multiplication,  division,  etc.)  can  be  broken  up  into  an 
interrelated  sequence  of  smaller  operations ,  such  as  subtraction  of  exponents , 
matching  of  exponents,  taking  the  reciprocal,  obtaining  the  half  sum  and 
carry,  and  other  microoperations.  If  the  operations  are  not  interrelated  in 
data,  e.g.,  if  it  is  required  to  add  vectors  component  by  component,  before 
the  operations  on  the  first  paid  of  vector  components  are  completed  and  after 
finishing  a  particular  microoperation  it  is  possible  to  begin  the  addition  of 
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the  next  pair  of  vector  components,  and  so  forth.  If  the  operation  of 
addition  is  broken  up  into  N„op  consecutive  microoperations,  the  times  of 
execution  of  which  (r^^p)  are  equal  to  each  other,  then  after  a  time  ^mop  ^  ^mop 
all  the  devices  performing  the  microoperations  will  be  engaged  in  working  on 
the  vector  components.  In  this  case,  the  speed  of  the  adding  device  will  be 
N^op  faster  than  that  when  working  with  single  interrelated  data. 

The  effect  of  the  possible  increase  in  speed  when  working  with  vectorized  data 
may  be  augmented  by  considering  simultaneous  execution  of  not  one,  but  several 
different  (including  data- interrelated)  operations  on  vectors.  Thus,  for 
example,  if  it  is  needed  to  perform  the  following  sequence  of  operations  on 
vectors : 


A  X  B  +  C  X  D.^ 

R  ;  =  E 

we  may  organize  a  vector  processing  "pipeline,"  as  shown  in  Fig.  1.  In  this 
case ,  all  the  devices  performing  microoperations  will  be  full  after  a  time 
20  Tnjo  .  It  is  assumed  that  the  multiplication  device  performs  its  operation 
in  six,  the  addition  device  in  four,  and  the  division  device  in  ten  r^^p,  and 
all  of  these  have  the  appropriate  number  of  consecutively  working  microopera¬ 
tion  blocks.  The  thus -organized  vector  processor  in  the  above  example  will  be 
26  times  faster  than  an  ordinary  scalar  processor,  consecutively  handling  the 
operations  by  the  identical  formula: 


a  X  b  +  c  X  d^ 

r  :  =  e 


X 


Fig.  1.  A  vector  processing  pipeline. 


Even  if  the  scalar  processor  has  two  multiplication  devices  and  the  multipli¬ 
cation  operations  are  performed  at  the  same  time,  its  speed  will  still  be 


^R,  A,  B,  C,  D,  E  -  vectors  having  n  components.  Here  and  below,  component 
operations  with  vectors  are  understood. 

^r,  a,  b,  c,  d,  e  -  scalars. 
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20  times  less  than  that  of  the  vector  processor.  However,  to  achieve  such 
performance,  the  vector  processor  must  have  a  strictly  (accurate  down  to  r) 
s3mchronized  input  of  operands  and  the  devices  for  performing  the  computations 
must  be  free  at  the  required  time.  Satisfaction  of  these  demands  is  achieved 
by  total  monopolization  of  the  resources  of  the  processor  and  working  storage 
on  behalf  of  solving  the  single  problem.  Thus,  simultaneous  operation  of 
several  processors  with  a  shared  working  storage  and  multiprogram  operation 
with  dynamic  distribution  of  the  resources  in  the  course  of  solving  a  problem 
are  excluded.  In  such  conditions,  when  using  a  wide-format  instruction  that 
actually  consists  of  several  operations,  it  is  possible  to  perform  a  (static) 
distribution  of  the  resources  over  time  during  the  process  of  translation  with 
accuracy  down  to  an  interval  in  order  to  evaluate  different  vector 

expressions,  hereafter  known  as  macrobased  crosses. 

Inevitable  equipment  standstills  occur  when  switching  from  the  running  of  one 
macrobased  cross  to  another.  Therefore,  the  longer  the  vectors,  the  fewer 
such  losses.  Thus,  for  a  vector  20  elements  long,  the  efficiency  in  using  the 
equipment  for  the  above  example  cannot  be  more  than  50  percent,  and  conse¬ 
quently  speed  of  the  processor  is  reduced  by  half.  The  losses  in  speed  are 
associated  with  equipment  standstill  during  the  time  of  filling  of  the 
pipeline  and  its  emptying,  in  other  words,  during  the  "runup"  and  "slowdown." 
To  be  sure,  these  times  are  functions  of  the  time  to  read  the  vector  from 
memory- -the  memory  access  time  (T^cc) •  Thus,  in  the  CRAY-1,  eight  directly- 
addressable  fast  vector  registers  (64  elements  each)  have  been  introduced 
between  the  memory  and  the  processor  in  order  to  shorten  the  time  of  runup  and 
slowdown,  which  has  greatly  increased  the  speed  of  the  complex  for  short 
vectors,  as  compared  to  the  computer  Cyber  205  [13,  14,  15,  16].  Yet  intro¬ 
ducing  such  small  number  of  directly-addressable  registers  has  limited  the 
capacity  of  the  CRAY-1  to  create  different  pipeline  configurations  adapted  to 
the  macrobased  crosses  and  made  the  programming  much  more  complicated.  There¬ 
fore,  it  is  better  for  the  processor  to  work  directly  with  local  RAM,  provided 
that  the  times  of  runup  and  slowdown  are  substantially  reduced. 

If  the  system  is  strictly  determined  in  time,  it  is  possible  to  delegate  to 
the  translator  the  task  of  allocation  of  hardware  resources  during  the  per¬ 
formance  of  the  macrobased  cross,  as  well  as  the  task  of  preparing  the  next 
configuration  of  the  processor  while  the  old  one  is  still  running.  This 
involves  a  preliminary  fetching  of  operands  from  the  local  memory  and  a  using 
of  the  actuators  as  they  become  available- -the  stage  of  slowdown  is  integrated 
in  time  with  the  stage  of  startup  of  the  new  macrobased  cross. 

It  should  be  noted  that  the  effectiveness  of  this  method  depends  largely  on 
the  quality  of  the  translator,  since  the  processor-memory  resources  are 
allocated  during  the  translation  and  before  the  program  is  executed.  The  task 
of  the  translator  is  substantially  complicated  when  the  vectors  are  small  in 
dimension,  n  <  N^j^p,  where  n  is  the  number  of  elements  of  the  vector  to  be 
executed  by  the  macrobased  cross,  and  is  the  number  of  microoperations  of 
the  pipeline  adapted  for  this  cross.  When  n  *=  1,  the  efficiency  of  this 
method  is  greatly  reduced,  since  the  pipeline  must  be  reconfigured  in  each 
clock  period  for  sufficiently  complete  utilization  of  the  pipeline. 

It  is  necessary  to  observe  yet  another  feature  of  the  operation  of  a  vector 
processor.  Suppose  that  the  pipeline  configuration  shown  in  Fig.  1  is 
operating.  Then,  in  each  clock  period  (r^jop),  five  numbers  should  be  read 
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from  memory  and  one  written.  If  —  12.5  ns  (as  in  the  CRAY-1),  the  memory 
retrieval  cycle  should  be  «2  ns. 

To  solve  rather  large  problems,  a  memory  of  tens  of  millions  of  words  is 
required.  It  is  virtually  impossible  to  construct  a  memory  of  the  required 
volume,  operating  at  a  speed  of  several  nanoseconds,  without  substantial 
forfeiture  in  the  memory  access  time  (T^cc) •  question  therefore  arises  as 

to  the  volijune  of  the  scratch-pad  memory  of  the  vector  processor,  which  should 
be  such  that  the  rate  of  exchange  with  the  next  memory  level  is  at  least  an 
order  of  magnitude  smaller  thanks  to  repeated  use  of  the  data  located  therein. 

Taking  a  scalar  processor  as  an  analogy,  this  condition  is  met  with  a 
scratch -pad  memory  or  cache  memory  volume  of  several  thousand  words . 
Accordingly,  for  a  vector  processor  with  average  vector  length  of 
100  elements,  we  may  assume  a  scratch-pad  memory  volume  equaling  several 
hundreds  of  thousands  of  words.  Thus,  a  vector  processor  must  have  a  local 
scratch-pad  memory  much  larger  than  the  memory  of  a  scalar  processor.  Of 
course,  such  different  methods  of  organization  of  the  computing  processes  for 
scalar  and  vector  arithmetic  requires  fundamentally  different  architecture  and 
circuitry  in  the  design  of  the  computer  hardware,  specifically; 

-  the  scalar  processor  should  have  a  minimum  time  of  execution  of  operations 
and  time  of  memory  access.  The  vector  processor  should  have,  first  and 
foremost,  a  maximum  speed  of  computations  and  maximum  memory  throughput 
capacity  [1]; 

-  the  hardware  of  the  scalar  processor  should  ensure  effective  operation  in 
multiprogram  mode.  The  vector  process  should  be  oriented  to  solving  large 
problems  in  monoprogram  mode; 

-  the  most  efficient  method  of  allocation  of  resources  (actuators,  registers, 
etc.)  of  a  scalar  processor,  as  shown  by  the  experience  with  the  use  of  the 
Elbrus  MCC,  is  dynamic  (as  the  programs  are  being  executed),  to  be 
accomplished  (generally)  by  hardware.  Effective  execution  of  programs  in  the 
vector  processor  requires  a  static  allocation  of  resources  during  the  transla¬ 
tion  of  the  programs ,  which  should  be  secured  by  completely  different  hardware 
strategies  in  the  design  of  the  processor. 

The  integration  of  such  conflicting  demands  in  a  single  processor  leads  to 
relatively  incompatible  circuitry  structures  of  the  processor  design.  The 
inevitable  compromise  solutions  must  reduce  the  maximum  speed  of  either  the 
vector  or  the  scalar  portion  of  the  processor.  Therefore,  the  solution  to  the 
problem  of  achieving  maximum  speed  is  best  achieved  in  a  multiprocessor 
complex  with  extensive  use  of  specialized  vector  processors,  alongside  univer¬ 
sal  processors  (oriented  to  scalar  computations) .  This  technique  of  vector - 
ization  was  selected  in  the  creation  of  the  Elbrus -2  MCC. 

A  similarly  designed  complex  is  the  supercomputer  Cyber  205,  which  in  its  full 
array  has  four  vector  processors  and  one  universal  processor.  The 
Elbrus -2  with  vector  processors  would  have  certain  advantages  over  the 
Cyber  205,  thanks  to  the  possible  creation  of  an  array  consisting  of  any 
desired  combination  of  vector  and  scalar  processors  and  the  presence  of  a  very 
high  speed  local  memory  in  each  vector  processor.  In  this  way,  the  speed  of 
the  Elbrus -2  MCC  with  vector  processor  would  be  comparable  to  that  of  the 
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Cyber  205  in  many  applications,  despite  the  fact  that  the  componentry  of  the 
latter  is  nearly  three  times  faster. 

The  architectural  features  of  a  vector  processor  are  important  when  it  is 
working  in  an  array.  Thus,  while  the  central  processors  in  the  Elbrus -2  MCC 
are  loaded  during  the  execution  of  the  computation  by  hardware  implementation 
of  the  principle  of  depersonalized  operation,  the  loading  of  the  vector 
processor  should  be  done  by  the  complex  with  program  control.  In  order  to 
secure  a  strict  synchronization  of  the  data  arriving  at  the  inputs  of  the 
actuators,  and  also  substantially  reduce  the  time  of  working  with  the  common 
memory,  the  vector  processor  should  be  coupled  to  the  complex  via  a  local 
memory  of  sufficiently  large  size. 

A  block  diagram  of  the  vector  processor  in  the  Elbrus -2  MCC  is  shown  in 
Fig.  2. 

In  designing  the  vector  processor,  it  was  extremely  important  to  determine  the 
necessary  memory  volumes  and  Q2,  in  order  to  satisfy  the  relations: 


—  E^/E2  and  K2  —  E2/E3 , 


which  obtain  in  the  solution  of  the  majority  of  problems. 

The  maximum  handling  capacity  of  the  vector  processor  E^  =  1  word/5  ns  can  be 
determined  from  the  operation  of  the  VP  under  maximum  workload  of  its 
actuators  (Fig.  3).  The  value  of  E2  should  not  exceed  the  maximum  rate  of 
exchange  of  the  central  processor  (E2  =  1  word/80  ns)  [3].  Thus,  should  be 
greater  than  16. 

If  we  grant  that  five  disks  can  work  at  the  same  time  with  a  single  vector 
processor,  then  E3  will  be  determined  by  the  total  average  exchange  rate  of 
2  fMS  per  word  and  the  value  of  K2  should  be  no  more  than  25.  The  possibility 
of  fulfilling  these  relationships  has  been  checked  by  a  model  investigation  of 
the  Navier-Stokes  problem  (see  below  and  [7]),  which  also  determined  the 
volumes  of  the  local  memory,  =  500,000  words,  and  the  working  memory 
Q2  *=  16  million  words. 

The  delegation  of  vector  arithmetic  to  a  separate  processor  also  has  negative 
aspects,  to  be  sure.  One  of  these  is  that  purely  vector  or  purely  scalar 
computations  do  not  commonly  occur,  and  thus  there  may  be  losses  in  transfer 
of  data  from  the  vector  to  the  scalar  processor  and  vice  versa.  These  losses 
are  moderated  by  the  following  factors: 

a)  the  central  processor  of  Elbrus -2  has  hardware  support  for  the  execution  of 
vector  computations,  thus  increasing  the  speed  in  the  vector  processing  cycles 
by  10-20  percent,  and  the  vector  processor  is  able  to  perform  scalar  computa¬ 
tions  using  the  very  same  actuating  devices; 

b)  the  time  to  transfer  data  from  the  vector  to  the  scalar  processor  and  back 
has  been  reduced  to  the  utmost:  the  transfer  of  10,000  words  from  the  vector 
to  the  central  processor  and  back  occupies  no  more  than  1  mis  (the  maximum 
handling  capacity  for  the  entire  working  memory  is  5  ns  [3]); 
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c)  successful  projects  are  being  accomplished  to  increase  the  degree  of 
vector ization  of  algorithms  while  at  the  same  time  localizing  them.  In  this 
case,  the  scalar  portion  of  the  computations  can  be  done  in  the  background  of 
the  vector  portion  [10,  11], 
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Fig.  2.  Block  diagram  of  the  Elbrus -2  MCC  with  vector 
processor. 


a.  Central  processor  d.  Working  memory 

b.  Vector  processor  e.  External  disk 

c.  Local  memory  of  memory 

vector  processor 

Note:  E^,  E2,  E3  -  average  handling  capacity  of 

communication  channels;  Q^,  Q2,  Q3  -  volumes  of  local, 
working,  and  external  memory. 
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Fig.  3.  Configuration  of  a  vector  processor  pipeline 
making  maximum  use  of  the  dynamic  resource  of  working 
memory  [1] . 


New  and  original  architecture  solutions  have  been  proposed  in  the  development 
of  the  vector  processor,  substantially  improving  the  effectiveness  of  utiliza¬ 
tion  of  its  actuating  devices.  These  include,  first  and  foremost,  firmware 
principles  of  reconfiguration  of  the  processor  in  order  to  adapt  it  to  execu¬ 
tion  of  different  macrobased  crosses,  juxtaposition  in  time  of  the  preparation 
of  the  next  processor  configuration  during  the  completion  of  running  of  the 
previous  configuration,  or  time  integration  of  the  stages  of  startup  and 
slowdown,  the  handling  of  the  "read  vector"  instruction  immediately  after  the 
"write"  instruction,  and  so  on. 

What  is  new  and  totally  original  is  the  high  level  language,  which  along  with 
a  natural  description  of  the  algorithms  ensures  effective  utilization  of  the 
equipment  of  the  vector  processor.  New  methods  have  been  developed  for  the 
organization  of  the  optimizing  translator.  The  project  was  carried  out  in 
view  of  the  experience  from  debugging  and  operation  of  mainframe  computer 
systems  [5,  6]. 

All  of  these  new  solutions  will  surely  be  used  extensively  in  future  Soviet 
supercomputer  projects.  A  detailed  description  of  the  most  interesting  and 
original  architectural,  circuitry  and  software  solutions  is  presented  in  the 
current  work  and  in  the  respective  articles  of  the  collection. 

2.  Architectural  Features  of  the  Vector  Processor 

Two  vector  processors- -the  CRAY-1  and  Cyber  205- -are  currently  in  greatest 
use.  It  is  therefore  convenient  to  relate  the  design  features  of  the  vector 
processor  to  these  architectural  solutions  [13,  14,  15,  16] . 

As  already  pointed  out,  one  of  the  methods  of  raising  the  speed  of  a  vector 
processor  is  the  organization  of  a  parallel  operation  of  its  functional 
arithmetic  executing  devices  by  connecting  them  into  a  processing  pipeline  so 
that  the  outputs  of  certain  devices  are  connected  directly  to  the  outputs  of 
other  devices  (see  Fig.  1).  Hereafter,  we  shall  call  this  concatenation. 

The  CRAY-1  has  three  functional  vector  devices  that  can  be  concatenated.  The 
concatenation  is  done  in  hardware  fashion  (the  programmer  may  not  even  be 
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aware  of  it) ,  and  therefore  the  intermediate  results  (despite  the  concatena¬ 
tion)  should  be  entered  into  the  vector  registers,  which  lessens  the  overall 
speed  of  the  vector  processor. 

In  the  Cyber  205,  high  performance  is  achieved  by  parallel  operation  of 
several  processing  pipelines  (as  many  as  four) ,  In  the  processing  pipeline 
itself,  two  devices  may  be  concatenated,  for  which  purpose  an  explicit  state¬ 
ment  is  required  in  the  program,  the  LINK  instruction  in  front  of  the  two 
operations  being  concatenated. 

In  our  vector  processor  (VP) ,  as  many  as  six  functional  devices  can  operate 
simultaneously : 

-  two  addition  devices; 

-  two  multiplication  devices; 

-  a  division  device; 

-  a  logic  device. 

In  the  vector  processor,  the  programming  of  a  particular  formula  is  done  not 
by  a  sequence  of  operations,  but  by  specifying  the  connections  among  the  group 
of  functional  devices  performing  these  operations.  The  inputs  of  each  of  the 
six  devices  may  be  connected  directly  to  the  output  of  any  given  device.  In 
order  to  compute,  for  example,  the  vector  expression: 

A  -f  B  X  C  (1) 

G  :  =  D  +  E  X  F 

where  A,  B,  C,  D,  E,  F,  G  are  vectors  having  n  components,  we  may  construct  a 
processing  pipeline  with  the  configuration  shown  in  Fig.  3.  The  overall  speed 
of  the  VP  in  this  case  is  5/r,  where  r  is  the  work  pace  of  each  functional 
device . 

The  vector  processor,  like  the  Cyber  205,  works  directly  "from  memory  into 
memory"  (the  operand  is  read  from  main  memory  and  the  result  is  written  into 
main  memory) ,  which  substantially  enlarges  its  capabilities  and  simplifies  the 
programming  as  compared  to  the  CRAY-1.  However,  unlike  the  Cyber  205,  each 
processor  of  the  VP  in  the  Elbrus-2  has  its  own  local  memory- -the  vector 
processor  memory- -that  is  connected  by  standard  processor-memory  channel  to 
the  common  main  memory  of  the  latter  (see  Fig.  2).  As  a  result,  the  VP  has 
six  read- from -memory  flows  and  two  write -in-memory  flows.  This  is  sufficient 
to  construct  almost  any  reasonable  configuration  from  the  available  group  of 
functional  devices. 

The  two  write  flows  make  it  possible  to  construct  two  parallel -operating 
processing  pipelines  (Fig.  4)  to  compute,  e.g.,  two  expressions: 

D:«AxB+C;  H:=ExF+C. 

The  handling  capacity  of  the  vector  processor  memory,  which  should  provide  for 
the  overall  flow  of  the  processing  pipeline,  is  achieved  by  an  interleaving  of 
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the  memory  at  the  low- order  bits  of  the  address  for  the  parallel -operating 
modules . 


Fig.  4.  Two  parallel -operating  pipelines. 


So  that  no  limits  are  placed  on  the  mutual  arrangement  of  the  vectors  used  by 
the  processing  pipeline  in  the  memory,  the  flows  are  serviced  in  alternate 
batches  of  k  ntambers ,  where  k  is  the  number  of  memory  modules .  But  this  is 
only  possible  for  a  particular  spacing  at  which  the  vector  components  are 
lodged  in  memory. 

In  the  Cyber  205,  simultaneous  accessing  of  memory  modules  by  independent 
addresses  is  not  possible,  and  therefore  the  processing  pipeline  is  linked  to 
the  memory  by  superwords  with  a  spacing  equal  to  unity.  This  is  a  major 
limitation  that  reduces  the  effectiveness  of  working  with  multidimensional 
arrays . 

In  the  VP  memory,  the  modules  are  accessed  by  independent  addresses,  thus 
substantially  moderating  the  demands  on  the  spacing  of  the  vector  (with 
respect  to  difference  in  the  addresses  of  consecutive  vector  components).  If 
the  number  of  memory  modules  equals  a  power  of  two,  the  vector  spacing  should 
be  an  odd  number.  This  guarantees  that  any  consecutive  k  components  of  the 
vector  will  be  arranged  in  different  memory  modules .  Such  organization  of  the 
interaction  with  the  main  memory  enables  an  effective  working  with  multidimen¬ 
sional  arrays  of  data  for  all  measurements.  For  example,  if  it  is  necessary 
to  work  with  the  rows  and  columns  of  a  matrix,  it  may  be  necessary  to  intro¬ 
duce  an  imaginary  column  to  make  the  row  spacing  and  the  column  spacing  odd 
numbers . 

The  VP  has  a  private  buffer  for  each  read  and  write  flow.  The  buffer  of  the 
input  flow  from  memory  receives  k  numbers  at  the  same  time,  and  one  ntamber  is 
sent  out  to  the  input  of  a  particular  functional  device  in  each  clock  period. 
On  the  other  hand,  the  buffer  of  the  outgoing  flow  accumulates  one  word  from 
each  output  of  a  functional  device  and  sends  out  k  numbers  for  writing  into 
memory. 

One  of  the  major  deficiencies  of  Cyber  205  is  the  poor  effectiveness  of 
working  with  short  vectors.  Thus,  when  the  length  of  the  vectors  n  is  equal 
to  25,  Cyber  205  puts  forth  20  percent  of  its  maximtun  performance;  if  the 
length  is  100,  it  puts  forth  50  and  if  the  length  is  1000,  it  puts  forth 
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around  90  percent.  Such  as3rmptotic  relationship  is  the  result  of  the  great 
inertia  of  the  pipeline  and  the  losses  during  startup  and  slowdown. 

In  the  vector  processor,  special  attention  has  been  given  to  minimizing  the 
performance  losses  associated  with  clearing  the  pipeline  devices  upon  comple¬ 
tion  of  work  in  one  configuration  (slowdown  phase)  and  filling  the  pipeline  of 
the  other  configuration  (startup  phase)  by  introduction  of  firmware  to  inte¬ 
grate  the  startup  and  slowdown  phases,  thus  increasing  the  efficiency  of  the 
VP  when  working  with  short  vectors. 

Let  us  consider  the  functional  layout  of  the  vector  processor  (Fig.  5)  in 
order  to  explain  the  implementation  of  the  new  principles  of  processing  of 
vector  information.  The  processing  pipeline  includes: 

-  six  functional  arithmetic  pipeline  devices; 

-  six  incoming  flow  buffers; 

-  two  outgoing  flow  buffers; 

-  four  real  register  files  designed  to  hold  constance  and  scalar  variables. 

When  the  processing  pipeline  is  assembled,  the  links  between  the  above  devices 
are  established  by  the  configurator,  which  makes  it  possible  to  connect  each 
input  of  the  functional  devices,  the  register  files,  or  the  outgoing  flow 
buffers  to  the  output  of  any  given  functional  device,  register  file,  or 
incoming  flow  buffer. 

The  configurator  is  controlled  by  the  configuration  control  unit  in  accordance 
with  the  flow  of  configuration- change  instructions  arriving  from  the  instruc¬ 
tion  buffer.  The  other  instructions  are  sent  from  this  buffer  to  the  index 
processor,  which  is  a  universal  integer  processor  designed  to  organize  the 
computational  process  and  the  preliminary  processing  of  the  address  instruc¬ 
tions  and  exchange  instructions.  This  has  a  32-bit  arithmetic -logic  unit  and 
its  own  internal  memory  of  IK  word  capacity,  which  may  be  transcribed  into  the 
VP  memory  through  the  write  buffer  and  unloaded  from  the  VP  memory  through  the 
read  buffer.  In  addition,  the  index  processor  has  access  to  the  real  register 
files . 

The  exchange  device  organizes  an  exchange  between  the  VP  memory  via  the  input 
and  output  buffers  and  the  main  memory  of  the  MCC,  to  which  the  exchange 
device  is  connected  across  a  standard  interface  in  place  of  one  of  the  central 
processors  of  the  MCC. 

The  VP  memory  with  a  cycle  of  40  ns  has  an  interleaving  of  eight  independently 
addressable  modules.  Requests  for  the  VP  memory  are  handled  in  keeping  with 
their  priority.  The  highest  priority  belongs  to  the  write  flows,  after  which 
come  the  instruction  flow,  the  read  flows,  the  swap  flows  of  the  index 
processor  memory,  and  the  exchange  flow. 

The  address  unit  is  designed  to  generate  the  physical  addresses  for  access  to 
the  memory  modules  of  the  VP  from  all  users,  employing  the  current  starting 
address  and  the  spacing  of  the  flow  being  handled  at  the  given  moment  to 
generate  the  eight  addresses. 
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Fig.  5.  Functional  diagram  of  the  vector  processor. 
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The  vector  processor  uses  the  following  representation  of  numbers,  address 
system,  and  instruction  system. 

1.  REPRESENTATION  OF  NUMBERS.  There  are  two  formats  for  representation  of 
numbers  in  the  VP  instruction  system: 

-  64-bit  format  of  numbers  with  floating  decimal; 

-  32-bit  format  of  integers. 
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All  functional  pipeline  devices  work  only  with  64-bit  numbers. 

2.  ADDRESS  SYSTEM.  The  vector  processor  has  two  kinds  of  memory:  main  vector 
memory  and  register  files.  Both  the  index  processor  and  the  processing 
pipeline  may  access  the  register  files. 

The  information  in  the  vector  memory  is  arranged  in  arrays.  Programming  of 
all  access  to  the  vector  memory  is  done  in  mathematical  addresses.  The 
physical  addresses  of  the  instructions  and  data  are  generated  by  adding  the 
mathematical  address  to  the  value  of  the  particular  base  register. 

3.  INSTRUCTION  TYPES,  a)  The  instructions  of  the  index  processor  include: 

-  integer  arithmetic  operation; 

-  logic  operation; 

-  conditional  and  unconditional  jump  instructions; 

-  transfer  instructions  addressed  to  the  register  files; 

-  the  instruction  for  loading  the  vector  length  counter; 

-  special  instructions  for  hardware  support  of  the  programming  system. 

b)  Address  instructions  enable  addressing  of  scalars,  one -dimensional  and 
two-dimensional  arrays.  When  addressing  a  one -dimensional  array,  the  initial 
address,  interval  and  quantity  are  specified;  when  addressing  a  two- 
dimensional  array,  the  initial  address,  interval  for  the  first  measurement, 
quantity  for  the  first  measurement,  interval  for  the  second  measurement  and 
quantity  for  the  second  measurement  are  specified. 

c)  The  exchange  instruction,  like  the  address  instruction,  specifies  an 
exchange  array  in  the  vector  memory  and,  accordingly,  an  exchange  array  in  the 
RAM  of  the  MCC,  as  well  as  the  exchange  direction  ( input- output ) . 

d)  Reconfiguration  instructions  describe  the  process  of  reconfiguration  of  the 
processing  pipeline  over  time. 

As  an  example,  let  us  consider  the  program  for  evaluation  of  the  expression: 

D  :  -  (A  +  B)  X  C, 

where  A,  B,  C,  D  are  vectors  of  length  n.  The  program  consists  of  three 
parts . 

The  first  part  contains  the  address  instructions  for  computing  the  sequence  of 
addresses  of  the  components  of  each  vector  and  loading  the  vector  length 
counter . 

The  second  and  third  parts  consist  of  reconfiguration  instructions:  the 
second  part  describes  the  startup  phase- -the  consecutive  process  of  building 
the  configuration  for  evaluation  of  the  given  expression,  the  third  describes 
the  slowdown  phase- -the  consecutive  process  of  clearing  the  devices  upon 
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completion  of  computation  of  the  expression.  Hereafter,  such  program  for 
evaluation  of  a  vector  expression  containing  the  aforesaid  three  parts  will  be 
termed  a  macrobased  cross.  The  processing  pipeline  for  evaluation  of  the 
macrobased  cross  with  the  specified  formula  is  shown  in  Fig.  6. 


Fig.  6.  Pipeline  for  evaluation  of  a 
macrobased  cross. 


Say  the  length  of  the  pipeline  of  the  addition  device  is  equal  to  k  clock 
periods,  while  the  length  of  the  pipeline  of  the  multiplication  device  is 
equal  to  m  clock  periods.  The  startup  phase  of  the  macrobased  cross  may  be 
described  as  follows . 

1.  Connect  inputs  of  the  adder  to  the  incoming  flow  buffers  1  and  2,  enable 
flows  1  and  2. 

2 .  Delay  for  k  clock  periods . 

3.  Connect  first  input  of  multiplier  to  output  of  adder,  second  input  of 
multiplier  to  incoming  flow  buffer  3,  enable  flow  3. 

4.  Delay  for  m  clock  periods. 

5.  Connect  output  of  multiplier  to  outgoing  flow  buffer  7,  enable  flow  7. 

After  n  clock  periods  from  the  onset  of  the  startup  phase,  the  slowdoxm  phase 
of  the  macrobased  cross  should  be  initiated,  which  can  be  described  as 
follows . 

1.  Disable  the  exit  of  incoming  flows  1  and  2  from  the  buffers,  clear  the 
input  of  the  adder. 

2.  Delay  for  k  clock  periods. 

3.  Disable  exit  of  incoming  flow  3  from  buffer,  clear  input  of  multiplier. 

4.  Delay  for  m  clock  periods. 

5.  Clear  input  of  outgoing  flow  buffer  7. 

The  reconfiguration  instructions  include: 

a)  instructions  for  partial  reconfiguration  of  the  startup  phase,  which 
describe  the  dynamics  of  change  in  configuration  of  the  startup  phase  over 
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time.  The  instruction  indicates  the  connections  that  should  be  established 
between  the  devices  and  how  many  clock  periods  the  configuration  should  remain 
unchanged; 

b)  the  instruction  for  the  time  diagram  of  the  startup  phase  of  the  macrobased 
cross,  which  indicates  (for  each  device  involved  in  the  processing  pipeline) 
the  delay  in  clock  periods  from  the  onset  of  the  startup  phase  till  the  time 
when  the  device  is  required  in  the  configuration; 

c)  the  instruction  of  the  time  diagram  of  the  slowdown  phase  of  the  macrobased 
cross,  which  indicates  (for  each  device  involved  in  the  processing  pipeline) 
the  delay  in  clock  periods  from  the  onset  of  the  slowdown  phase  till  the 
clearing  of  the  input  of  each  device. 

The  partial  reconfiguration  instruction  is  local  in  action,  i.e.,  it  changes 
only  those  connections  in  the  processing  pipeline  that  are  explicitly  stated 
(everywhere  else  the  configuration  remains  as  before).  The  quality  of 
localness  makes  it  possible  to  combine  independent  startup  and  slowdown 
phases .  The  startup  phase  of  a  macrobased  cross  allows  a  gradual  building  of 
the  configuration- -from  the  incoming  flow  buffers  to  the  required  functional 
devices  at  the  exact  time  when  they  come  into  actual  use.  This  makes  it 
possible  to  begin  the  building  of  the  next  configuration  without  waiting  till 
the  end  of  the  computation  in  the  previous  configuration,  and  to  make  use  of 
the  devices  in  the  next  clock  period  after  the  clearing  of  their  input  by  the 
previous  configuration. 

The  instructions  of  the  time  diagrams  of  the  startup  and  slowdown  phases  have 
been  specially  introduced  to  achieve  a  hardware  integration  of  two  contiguous 
macrobased  crosses .  The  slowdown  phase  of  the  cross  begins  in  n  clock  periods 
after  the  onset  of  the  startup  phase,  where  n  is  the  length  of  the  vector. 

It  is  natural  to  require  that  the  reconfiguration  program  of  the  macrobased 
cross  be  independent  of  the  length  of  the  vector.  However,  for  short  vectors, 
it  is  possible  to  combine  the  startup  phase  and  the  slowdown  phase  of  the 
cross.  This  situation  is  diagrammed  in  Fig.  7a.  In  the  interval  A,  it  is 
necessary  to  execute  two  flows  of  reconfiguration  instructions  at  the  same 
time . 

In  order  to  integrate  in  time  the  slowdown  phase  of  the  current  macrobased 
cross  with  the  startup  phase  of  the  following  macrobased  cross,  it  is  no 
longer  sufficient  to  execute  only  two  flows  of  reconfiguration  instructions  at 
the  same  time,  since  the  situation  shown  in  Fig.  7b  is  possible  in  short 
vectors.  In  the  interval  A,  three  flows  of  reconfiguration  instructions 
should  be  executed  at  the  same  time. 

The  vector  processor  realizes  three  simultaneous  flows  of  reconfiguration 
instructions.  The  starting  of  the  next  macrobased  cross  may  begin  after 
completion  of  the  startup  phase  of  the  current  cross,  i.e.,  the  length  of 
vectors  in  which  it  is  possible  to  work  without  loss  is  equal  to  the  length  of 
the  processing  pipeline  (on  the  order  of  20-30). 

As  for  any  other  pipeline  working  "from  memory  into  memory,"  the  problem  of 
read  after  write  (RAW)  arises  in  the  VP,  the  nature  of  which  is  as  follows. 

Say  a  macrobased  cross  is  using  the  result  of  a  previous  cross.  For  proper 
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functioning,  the  execution  of  the  second  macrobased  cross  must  not  begin  until 
the  pipeline  is  emptied  and  all  results  of  the  first  cross  are  written  into 
memory,  i.e.,  it  is  not  permitted  to  combine  the  startup  and  slowdown  of  two 
such  macrobased  crosses. 
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Fig.  7.  Time  diagrams  of  operation  of  a  vector 
processor  in  the  startup  and  slowdown  stage:  a  -  one 
macrobased  cross;  b  -  two  consecutive  macrobased 
crosses. 
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In  the  VP,  the  RAW  problem  is  solved  in  firmware  fashion.  For  each  array,  the 
address  instruction  contains  information  about  the  range  of  variation  of 
addresses  (R) .  The  programming  system  guarantees  that  the  addresses  of  all 
elements  of  the  array  are  contained  in  the  segment  [AB,  AB  +  R] ,  where  AB  is 
the  address  of  the  beginning  of  the  array.  If  a  segment  of  addresses  of  the 
read  array  of  the  next  macrobased  cross  intersects  a  segment  of  addresses  of 
the  write  array  of  a  previous  cross,  the  reading  of  elements  of  the  particular 
cross  is  postponed  in  hardware  manner  until  all  results  of  the  previous  cross 
are  written  into  memory ,  In  other  situations ,  combination  of  crosses  is 
permitted.  Comparison  of  the  segments  is  done  by  the  hardware,  prior  to 
fetching  the  operands  in  memory. 

The  architectural  solutions  adopted  ensure  maximum  speed  of  the  processor  in 
the  componentry  of  the  Elbrus -2  MCC  (r  =  40  ns)  equaling  125  megaflops  per 
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second.^  The  actual  speed  of  the  processor  was  evaluated  in  a  model  investi¬ 
gation  of  the  solving  of  the  Navier-Stokes  equations  in  a  three-dimensional 
domain.  This  same  problem  was  also  used  to  clarify  the  required  volumes  of 
vector  and  working  memory  and  Q2)  and  the  required  handling  capacity  of 
the  communication  channels  (Ej^,  E2,  E3)  (see  Fig.  2). 

3.  Results  of  Model  Investigation  of  the  Solving  of  the  Navier-Stokes 
Equations  in  a  Three-Dimensional  Domain  in  the  VP 

The  choice  of  the  problem  of  solving  the  Navier-Stokes  equations  in  a  three- 
dimensional  domain  as  a  test  case  for  the  vector  processor  was  dictated  by  the 
following  main  factors : 

First,  the  problem  of  solving  three-dimensional  Navier-Stokes  equations  with 
an  increased  number  of  nodes,  beyond  the  power  of  the  available  computers, 
does  exist  and  is  urgent.  Second,  algorithms  have  been  developed  today  for 
parallel  solving  of  three-dimensional  Navier-Stokes  equations  that  are 
suitable  for  implementation  in  a  vector  processor  [8,  9].  Third,  the  Navier- 
Stokes  problem  bears  features  of  a  typical  problem  with  regulator  grid,  which 
places  methodological  value  on  this  programming  test. 

The  programming  of  the  Navier-Stokes  problem  was  done  in  the  underlying 
programming  language  and  pursued  the  following  goals. 

1)  to  assess  the  possibility  of  pipeline  parallel  processing  of  the  problem; 

2)  to  assess  the  volume  of  the  vector  memory  needed  for  effective  implementa¬ 
tion  of  the  algorithm; 

3)  to  estimate  the  time  to  solve  the  problem; 

4)  to  evaluate  the  effectiveness  of  utilization  of  the  processor  equipment  in 
the  solving  of  the  problem. 

It  should  be  pointed  out  that  solving  such  problem  on  a  grid  of 
10  X  10  X  15  nodes  in  the  BESM-6  takes  about  one  hour,  and  increasing  the 
number  of  nodes  is  not  permitted  by  the  low  capacity  of  the  working  storage. 
Solving  the  problem  in  the  BESM-6  on  a  grid  of  50  x  50  x  50  nodes  is  almost 
impossible,  yet  it  is  just  such  problems  that  are  becoming  very  urgent.  We 
now  examine  this  problem.  The  solution  algorithm  is  based  on  the  method  of 
partitioning  of  spatial  variables  and  physical  processes  and  employs  an 
algorithm  developed  at  the  ITPM  SO  AN  SSSR  [8,  9]. 

In  the  algorithm,  by  a  special  transformation  of  coordinates,  the  physical 
domain  with  irregular  grid  is  turned  into  a  cube  with  side  N  with  uniform 
grid,  in  which  the  calculation  is  performed.  Each  time  step  of  the  solution 
is  broken  up  into  six  consecutive  minor  steps.  Six  of  the  minor  steps  include 


=  40  ns  is  the  nominal  clock  period  of  the  vector  processor.  The 
operating  clock  period  of  the  Elbrus -2  MCC  is  r  *=  44  ns.  In  the  model 
investigation  of  the  working  of  the  vector  processor  with  Navier-Stokes 
equations,  it  was  assumed  that  r  «  44  ns. 
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two  steps  apiece  in  each  of  the  three  orthogonal  directions.  The  equations  of 
each  minor  step  include  the  first  and  second  partial  derivatives  of  unknowns 
with  respect  to  only  one  spatial  variable ,  Thus ,  the  performance  of  one  minor 
step  results  in  an  independent  solution  for  N  x  N  one -dimensional  boundary 
value  problems. 

The  solution  employs  an  implicit  difference  scheme  with  seven-point  spatial 
pattern,  each  minor  step  comes  down  to  solving  between  one  and  five  systems  of 
linear  equations  with  three -diagonal  matrix: 


+  Bj  Yj  +  CfYi.i,  a  F|  ,  1,  a  ■ 

Am  Vm-4  +  Vm  a  Fm-  . 

The  coefficients  and  right-hand  sides  of  these  equations  may  be  computed 
independently  of  each  other  and  in  parallel  throughout  the  computational 
domain.  Calculation  of  the  coefficients  and  right-hand  sides  involves 
evaluation  of  the  first  and  second  derivatives  of  unknown  variables.  The 
three -diagonal  systems  of  equations  are  solved  by  the  method  of  a  scalar  run 
through  in  each  minor  step,  and  scalar  passes  can  be  accomplished  indepen¬ 
dently  of  each  other. 

The  structure  of  the  processor  imposes  certain  limits  on  the  possible  parallel 
execution  of  the  algorithm.  The  limited  volume  of  the  vector  memory  cannot 
hold  all  problem  data  at  the  same  time,  and  therefore  the  data  are  kept  in  a 
second- level  memory,  and  only  information  pertaining  to  a  certain  layer  of  the 
calculation  region  is  called  into  the  vector  memory.  The  vector  processor 
carries  out  the  calculation  by  layers.  In  Fig.  8,  the  cube  represents  the 
calculation  domain,  while  the  planes  depicted  therein  represent  data  kept  in 
the  vector  memory. 


Fig.  8.  Representation  of  data  kept  in  the 
vector  memory. 

Say  the  first  minor  step  involves  derivatives  of  unknowns  with  respect  to  r , 
the  second  with  respect  to  rj ,  the  third  with  respect  to  f,  the  fourth  again 
with  respect  to  the  fifth  with  respect  to  r;,  and  the  sixth  with  respect  to 
f.  Then,  using  data  of  the  k-1  and  k+1  planes,  the  coefficients  and 
right-hand  sides  of  the  three -diagonal  systems  of  the  first  minor  step  are 
evaluated  for  the  k-th  plane.  These  computations  require  three  contiguous 
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parallel  planes,  since  the  partial  derivatives  of  first  and  second  order  are 
evaluated  not  only  from  ^  and  ,  but  also  from  the  normal ,  i .  e .  ,  along  f . 

In  order  to  evaluate  the  coefficients  and  right-hand  sides  of  the  equations  of 
the  first  minor  step,  a  pipeline  processing  is  done  for  the  matrix  of  the  k-th 
plane.  The  three -diagonal  system  is  solved  by  the  economical  method  of  a 
scalar  run  through.  In  the  k-th  plane,  50  scalar  passes  along  ^  are  required. 

An  individual  scalar  pass  is  poorly  suited  to  parallel  pipeline  processing, 
since  both  the  forward  pass: 


and  the  backward  pass: 
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use  recurrent  formulas  for  the  calculation. 

However,  the  simultaneous  execution  of  many  (15)  independent  passes  allows  the 
use  of  parallel  pipeline  processing. 

Thus,  a  "vector”  of  scalar  passes  along  ^  (first  forward,  then  backward)  is 
accomplished.  With  this,  the  calculation  of  the  first  minor  step  is  finished. 

The  calculation  of  the  second  minor  step  is  done  along  in  the  very  same  k-th 
plane.  As  in  the  first  minor  step,  the  coefficients  are  evaluated,  and  then  a 
pass  is  made  along  17. 

By  adding  information  from  the  following  k  +  2  plane  to  the  vector  memory ,  we 
may  perform  the  calculation  of  the  first  and  second  minor  steps  for  the  next 
k  +  1  plane.  Moving  from  one  plane  to  another,  the  calculation  of  the  first 
and  second  minor  steps  is  performed  for  the  entire  calculation  domain. 

A  different  technique  of  parallel  processing  is  used  for  the  third  minor  step 
along  f  (the  normal  to  ^17). 

In  moving  from  the  k-th  to  the  k  +  1-th  plane  during  calculation  of  the  first 
and  second  minor  steps ,  the  coefficients  are  evaluated  for  all  50  x  50  systems 
of  equations  of  the  third  minor  step  and  the  k-th  recurrent  term  is  evaluated 
for  each  of  the  50  x  50  forward  passes  along  f .  All  these  computations  for 
the  k-th  plane  are  done  in  pipeline  mode.  Thus,  after  completion  of  the  first 
and  second  minor  steps  in  the  entire  calculation  domain,  it  turns  out  that  a 
"matrix”  of  straight  passes  along  f  has  been  accomplished  at  the  same  time. 
Moving  through  the  planes  in  the  backward  direction,  it  is  possible  to 
accomplish  a  backward  pass  and  complete  the  third  minor  step  in  the  entire 
domain . 

In  the  programming,  the  completion  of  the  third  minor  step  in  the  k-th  plane 
is  combined  with  the  performance  of  the  fourth  and  fifth  minor  steps  in  this 
plane.  Furthermore,  while  moving  from  plane  to  plane  during  performance  of 
the  fourth  and  fifth  minor  steps  the  coefficients  of  the  equations  of  the 
sixth  minor  step  are  evaluated  and  the  k-th  recurrent  term  of  all  50  x  50 
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forward  passes  of  the  sixth  minor  step  is  evaluated.  The  final  pass  along  f 
completes  the  sixth  minor  step  and  the  entire  time  step  as  a  whole. 

More  careful  examination  of  the  computation  process  reveals  that  the  vector 
memory  must  simultaneously  hold  the  data  for  not  three,  but  four  planes  of  the 
calculation  domain.  Furthermore,  a  buffer  exchange  zone  should  be  set  aside 
in  the  vector  memory  for  the  data  of  one  plane. 

The  exchange  of  data  with  the  Elbrus  MCC  should  be  done  in  arrays  or  matrices 
corresponding  to  the  next  plane  of  the  calculation  space.  Each  node  of  the 
grid  during  the  process  of  the  computation  is  described  by  28  parameters. 

These  include  the  density  p,  temperature  T,  components  of  the  velocity  u,  v, 
w,  increments  of  these  quantities  to  be  evaluated  in  the  given  time  step, 

length  of  the  radius -vector  r,  the  thermal  conductance  A, 
the  viscosity  /x,  the  coefficients  of  transformation  of  the  coordinates  z^,  Z2, 
Z3,  721  73  through  coefficients  020 » 

Pj.  Thus,  the  data  of  the  problem  present  a  four -dimensional  array  of 
50  X  50  X  50  X  28. 

During  the  computation,  four  three-dimensional  arrays  corresponding  to  four 
planes,  i.e.,  a  dimension  of  50  x  50  x  28,  and  several  working  arrays  of 
dimension  50  x  50  are  present  in  the  vector  memory.  The  problem  is  programmed 
in  the  underlying  high-level  language.  The  features  of  the  programming 
language  make  it  possible  to  access  the  required  subarrays  in  the  vector 
memory.  The  program  has  a  modular  structure  and  contains  three  basic  modules. 

The  first  module  involves  calculation  in  the  current  plane  of  the  first  and 
second  minor  steps,  as  well  as  the  coefficients  and  evaluation  of  the  next 
recurrent  term  of  the  third  minor  step  in  the  entire  plane.  The  first  module 
also  contains  exchange  instructions  which  send  the  results  of  the  computation 
for  the  previous  plane  to  memory  and  prepare  the  data  for  the  following  plane . 
The  module  has  a  rather  large  amount  of  computation  and  not  many  exchanges. 

The  exchange  is  entirely  integrated  with  the  computation. 

The  second  module  completes  the  third  minor  step,  includes  the  calculation  of 
the  fourth  and  fifth  minor  steps ,  and  also  contains  a  calculation  of  the  next 
recurrent  term  of  the  sixth  minor  step  for  the  entire  current  plane.  During 
its  operation,  the  module  performs  an  exchange  with  the  memory  of  Elbrus -2. 

The  module  has  a  rather  large  ntunber  of  exchanges  and  the  computation  is  not 
fully  integrated  with  the  exchange. 

The  third  module  completes  the  sixth  minor  step  and  the  entire  time  step  as  a 
whole .  There  are  few  computations ,  and  the  operating  time  of  the  module  is 
determined  by  the  exchange. 

The  program  of  the  three  modules  numbers  819  vector  assignment  operators, 
corresponding  to  approximately  2700  ordinary  arithmetic  scalar  floating 
decimal  operations  per  node  in  one  time  step.  The  volume  of  computations  of 
the  entire  problem  amounts  to  10^^  operations. 

The  program,  the  electronic  componentry,  and  the  thoroughly  elaborated  struc¬ 
tural  scheme  of  the  vector  processor  enable  certain  quantitative  assessments 
of  its  performance. 
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In  the  first  place,  we  may  estimate  the  time  to  perform  any  given  stage  of  the 
computations:  the  time  to  perform  one  vector  assignment  operator  is  equal  to 

the  product  of  the  duration  of  the  clock  period  of  the  pipeline  and  the  number 
of  elements  of  the  array  (in  our  problem,  the  number  of  elements  is 
50  X  50  —  2500,  and  the  startup  time  can  be  ignored).  The  time  for  an 
exchange  is  estimated  by  the  ratio  of  the  volume  of  information  transmitted 
and  the  handling  capacity  of  the  exchange  channel.  Information  on  the  modules 
(for  a  simultaneous  step)  of  the  program  solving  the  Navier-Stokes  equations 
is  presented  in  Table  1  for  a  clock  period  of  duration  44  ns  and  a  rate  of 
exchange  of  four  million  words  per  second.  Thus,  the  problem  including 
300  time  steps  will  be  executed  in  27  min. 
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Secondly,  we  may  obtain  an  estimate  of  the  minimum  volume  of  the  vector  memory 
needed  to  implement  the  described  program.  The  vector  memory  should  hold  at 
least  the  data  of  five  planes  of  the  calculation  space,  i.e., 

5  X  50  X  50  X  28  =  350  10^  words.  Such  volume  allows  a  run  through  in  two 
orthogonal  directions  of  the  plane  without  supplementary  accessing  of  the 
external  memory.  If  a  sufficiently  fast  vector  memory  is  used,  the  speed  of 
the  process  in  this  case  is  determined  by  the  pace  of  operation  of  the 
processing  devices.  With  a  vector  memory  of  lesser  volume,  the  speed  of  the 
pipeline  begins  to  depend  explicitly  on  the  rate  of  exchange  (£3)  and  declines 
sharply.  In  other  words,  the  vector  memory  should  make  possible  a  reduction 
of  the  pace  of  accessing  of  the  outer  memory,  i.e.,  the  memory  of 
Elbrus -2  MCC,  while  holding  the  data  for  the  entire  problem,  in  the  amount  of 
3.6  million  words,  in  the  working  storage  of  the  Elbrus -2  MCC.  A  comparative 
evaluation  of  the  speed  of  the  BESM-6  and  the  vector  processor  in  solving  the 
Navier-Stokes  equations  is  illustrated  in  Table  2. 
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Table  2 


Parameter  Compared 


BESM-6 


Vector  Processor 


Dimensions  of  grid  10  x  10  x  15 

Volxime  of  problem  data  (in  3  x  10^ 

numbers) 


Problem  solving  time  (in  1 

hours) 

Relative  speed  (compared  to  1 

BESM-6) 


50  X  50  X  50 
3.5  X  10® 


0.5 

200 


The  program  developed  to  solve  the  Navier- Stokes  equations  made  it  possible  to 
evaluate  the  efficiency  of  utilization  of  the  vector  processor  equipment.  In 
particular,  it  was  found  that  3.34  of  the  five  functional  floating- decimal 
devices  (two  adders,  two  multipliers,  and  one  divider)  work  on  average, 
putting  forth  an  average  speed  of  approximately  80  MFLOPS .  A  histogram  of  the 
relative  frequency  of  appearance  of  configurations  with  different  nixmbers  of 
functional  devices  is  shown  in  Fig.  9. 


Fig.  9.  Histogram  of  relative  frequency  of 
appearance  of  configurations  (in  percent) 
with  differing  ntunber  of  functional 
devices . 

Key: 

a.  Niamber  of  FD  in  configuration 

It  is  interesting  to  observe  the  relatively  high  workload  of  the  division 
device:  it  occupies  38.4  percent  of  the  computation  time.  This  is  explained 

by  the  fact  that,  even  though  division  comprises  about  10  percent  of  the 
arithmetic  operations  in  the  present  problem  (as  in  many  scientific  and 
engineering  problems) ,  the  simultaneous  operation  of  the  two  adders  and  two 
multipliers  almost  quadruples  the  frequency  of  access  to  the  divider. 
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The  obtained  results  reveal  an  excellent  and  rather  well-balanced  workload  of 
the  functional  devices,  testifying  to  the  rational  choice  of  the  outfit  of 
functional  devices . 
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GENERAL 


MASSIV,  THE  UNDERLYING  PROGRAMMING  LANGUAGE  OF  THE  VECTOR  PROCESSOR 
907G0089B  Moscow  SUPER-EVM  in  Russian,  pp.  30-48 
[Article  by  V.  D.  Asriyeli  and  T.  A.  Popov] 

[Text]  The  vector  processor  (VP)  of  the  Elbrus  multiprocessor  computing 
complex  (MCC)  is  designed  for  effective  solving  of  a  broad  class  of  scientific 
and  engineering  problems  [1],  An  underlying  high-level  programming  language 
was  specially  developed  for  the  VP,  enabling  maximum  effectiveness  in  use  of 
the  VP  [2] .  The  present  work  contains  a  complete  description  of  the  realized 
version  of  the  underlying  language  of  the  VP  of  the  Elbrus  MCC. 

The  underlying  programming  language  of  the  VP  is  oriented  to  working  with 
arrays.  The  underlying  language  can  describe  arrays  with  dimensionality  not 
greater  than  four,  although  only  one -dimensional  or  two-dimensional  arrays  may 
take  part  in  the  operations .  The  underlying  language  allows  four  types  of 
variables: 

-  integer  (32-bit  signed  integer); 

-  scalar  (64-bit  real-valued  number) ; 

-  array  (consisting  of  64 -bit  real -valued  numbers) ; 

-  external  array  (located  in  the  second-level  memory). 

The  underlying  language  is  based  on  a  consecutive  operator  scheme  of  programs 
with  parallel  assignment  operator  of  array  type  (AOAT) .  The  semantics  of  the 
AOAT  distinguishes  this  scheme  from  that  underlying  the  majority  of  presentday 
programming  languages  (ALGOL,  FORTRAN,  etc.).  The  AOAT  is  defined  as  a 
compact  notation  of  a  set  of  elementary  assignment  operators  (EAO) ,  define®/on 
the  analogous  components  of  the  arrays,  and  the  use  of  the  AOAT  is  defined^^s 
an  asynchronous  parallel  execution  of  the  constituent  EAOs.  The  AOAT  is 
considered  to  be  executed  when  all  its  EAOs  have  been  executed. 

Descriptions 

The  unit  of  programming  (and  translation)  of  the  underlying  language  is  the 
module.  Of  the  modules  comprising  the  program  package,  the  module  which 
begins  the  running  of  the  program  is  singled  out.  This  head  module  (program) 
differs  from  an  ordinary  module  in  the  fact  that  it  contains  an  external 
memory  description  partition  instead  of  a  parameter  partition.  The  external 
memory  partition  is  used  to  map  the  external  arrays  on  the  second- level  memory 
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(RAM  of  the  Elbrus  MCC) .  The  initial  contents  of  the  second-level  memory  are 
formed  by  external  means  (relative  to  the  VP),  i.e.,  the  resources  of  the 
Elbrus  MCC. 

A  module  in  the  underlying  language  consists  of  descriptions  and  operators. 

The  descriptions  are  divided  into  five  partitions: 

-  the  partition  of  formal  parameters ; 

-  the  partition  of  local  variables; 

-  the  partition  of  subarrays; 

-  the  partition  of  invoked  modules; 

-  the  partition  of  macrodefinitions . 

The  formal  parameter  partition  is  designed  to  describe  the  parameters  of  the 
module.  The  information  contained  in  this  partition  is  used  in  particular  by 
the  compiler  in  compiling  the  object  modules.  A  module  may  have  formal 
parameters  of  all  four  types,  which  receive  the  values  of  the  actual 
parameters  when  the  module  is  called  up;  the  parameters  for  variables  of 
integer  and  scalar  type  are  given  by  a  value,  those  for  variables  of  array  and 
external  array  type  are  given  by  an  access  (i.e.,  a  formal  array-parameter 
gives  the  module  access  to  an  actual  array-parameter) . 

Let  us  consider  the  descriptions  of  the  arrays  more  closely.  The  underlying 
language  defines  two  similar  types  of  variables:  the  array  and  the  external 
array.  These  differ  only  in  the  fact  that: 

-  the  array  is  situated  in  the  RAM,  while  the  external  array  is  in  the  second- 
level  memory  (RAM  of  the  MCC) ; 

-  the  module  may  describe  a  local  variable  of  array  type ,  but  not  a  local 
variable  of  external  array  type; 

-  only  variables  of  array  type  may  participate  in  the  operations,  while  the 
variables  of  external  array  type  may  occur  only  in  exchange  operators. 

In  each  measurement,  the  elements  of  an  array  (and  an  external  array)  are 
numbered  with  nonnegative  whole  ntimbers,  starting  at  zero.  Thus,  if  the 
quantity  of  elements  in  a  certain  measurement  is  k,  the  elements  in  this 
measurement  are  numbered  from  0  to  k-1. 

The  local  variable  partition  describes  variables  which  are  created  when  the 
module  is  called  up  and  cancelled  upon  return  from  the  module.  All 
descriptors  of  measurements  in  the  description  of  a  local  array  are  of  the 
kind  <DESCRIPTIVE  EXPRESSION>,  where 

<DESCRIPTIVE  EXPRESSION>  ::  -=  <NAME  OF  ARRAY> . <NUMBER  OF  MEASUREMENT> | 

<D.NUMBER>| 

<DESCRIPTIVE  EXPRESSIONX+-XDESCRIPTIVE  EXPRESSION> 

<NUMBER  OF  MEASUREMENT>  : :  =  1 | 2 | 3 | 4 

<+->  : :  =  +1  - 
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The  value  of  the  descriptive  expression  determines  the  number  of  elements  in 
the  given  measurement  of  the  described  array.  If  the  descriptive  expression 
is  <NAME  OF  AREAY> . <NUMBER  OF  MEASUREMENT> ,  the  corresponding  quantities  of 
elements  are  linked  by  an  equivalence  relation  (i.e. ,  they  occupy  the  same 
registers  of  the  register  file).  Thus,  e.g.,  the  description  (Fig.  1): 

/M2/  D[B.l,  B.2-2] 

defines  a  two-dimensional  array,  the  nximber  of  rows  in  which  coincides  with 
the  quantity  of  elements  for  the  first  measurement  of  the  array  B,  while  the 
quantity  of  colvunns  is  2  less  than  the  quantity  of  columns  of  array  B.  The 
quantity  of  elements  in  the  first  measurement  of  array  D  and  the  quantity  of 
elements  in  the  first  measurement  of  array  B  are  linked  by  an  equivalence 
relation. 


%MOflyjib{nPMMEP)  (1) 
XnAPAMETPU  (  p  ) 

/M2/  A,B,C 

%JIOKAJIbHUE  (  3  ) 

/U/  N 

/IG/  d[b.1.B.2-2J 

mi  E^.l.C.O 

/MI/  F&-0  (4) 

XnOflMACCMBbr  ■  ^  ' 

/MI/  g[i3  »  b[»,5] 
/UMi/  h[x3(m)  =  bi;‘,n1 


XMOflyJIH  (5) 

/MOfl/  nPMMEPKB.D.H)  ' 


XOREPATOPH 

(7) 

nPMMEPKC.D.C) 

nPMMEPI(C,D,F) 

(8) 

SKOHEU 

(9) 
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/Mode/Example  1 

2. 

Parameters 

7. 

Operators 

3. 

Local 

8. 

Example  1 

4. 

5. 

Subarrays 

Modules 

9. 

End 

The  subarray  partition  describes  subarrays  of  previously  described  arrays 
(reference  arrays).  There  are  two  common  types  of  subarray:  static  and 
dynamic.  Only  a  formal  array,  a  local  array,  or  a  static  subarray  can  be  a 
reference  array. 

A  static  subarray  is  defined  once  when  the  module  is  called  up  and  cannot  be 
redefined  during  the  execution  of  the  module.  The  description  of  a  d3mamic 
subarray  uses  an  integer- type  associated  variable,  and  the  dynamic  subarray  is 
redefined  with  each  change  in  value  of  the  associated  variable  (Fig.  1) . 
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In  the  left  part  of  the  description  of  a  static  subarray,  in  place  of  the 
determinants  of  measurements  there  appear  the  nximbers  of  measurements  of  the 
reference  array,  providing  the  orientation  for  the  measurements  being 
described.  The  measurement  nximbers  will  be  different  in  different 
measurements  of  the  described  array.  In  the  right  part  of  the  description 
appears  the  name  of  the  reference  array,  while  the  square  brackets  contain 
clarifying  information  on  each  measurement  of  the  reference  array,  which  along 
with  the  left  part  of  the  description  uniquely  determines  the  array  being 
described.  The  clarifying  information  is  of  the  type: 

-  either  *,  when  the  measurement  of  the  described  array  referring  to  a  given 
measurement  of  the  reference  array  coincides  with  the  given  measurement; 

-  or  <START>:<QUANTITY>,  when  the  measurement  of  the  described  array  referring 
to  a  given  measurement  of  the  reference  array  is  a  contraction  of  the  given 
measurement.  If  the  quantity  is  of  the  type  <NAME  OF  ARRAY>.<NUMBER  OF 
MEASUREMENT> ,  then  the  determined  quantity  of  the  described  array  is  linked  by 
an  equivalence  relation  to  the  corresponding  quantity  of  the  previously 
described  array; 

-  or  <DESCR1PTIVE  EXPRESSION>,  when  there  is  no  reference  to  a  given  measure¬ 
ment  of  a  reference  array  in  the  array  being  described,  i.e.,  the  described 
array  has  lesser  dimensionality. 

<START>  : :  =  <DESCRIPTIVE  EXPRESSION> 

<QUANTITY>  : :  =  <DESCRIPTIVE  EXPRESS ION> 

For  example,  the  description 

/M2/  A[3,  2]  -  B[5,  *,  l:B.3-2] 

describes  a  two-dimensional  subarray  A  of  a  three-dimensional  array  B.  In 
array  B,  the  fifth  slice  is  taken  for  the  first  measurement.  In  the  resulting 
two-dimensional  array,  the  first  and  last  columns  are  discarded  and  the 
remaining  array  is  transposed.  The  array  resulting  from  all  of  these 
manipulations  is  the  subarray  A.  Of  course,  no  displacement,  discarding,  or 
transposition  of  the  elements  of  array  B  takes  place.  Array  A  simply  obtains 
access  to  those  elements  of  array  B  that  would  remain  if  the  above  transforma¬ 
tions  were  performed.  A  further  result  of  such  description  of  array  A  is  that 
the  quantity  of  elements  for  the  second  measurement  of  array  A  is  linked  by  an 
equivalence  relation  with  the  quantity  of  elements  for  the  second  measurement 
of  array  B. 

On  the  left  side  of  a  description  of  a  dynamic  subarray,  in  place  of  the 
measurement  descriptors  appear  the  numbers  of  the  measurements  of  the 
reference  array,  providing  an  orientation  for  the  measurements  being 
described.  The  numbers  of  the  measurements  will  be  different  in  different 
measurements  of  the  described  array.  Next,  in  round  brackets,  comes  the  name 
of  the  associated  integer  variable,  the  equal  sign,  the  name  of  the  reference 
array  and,  in  square  brackets,  the  determining  information.  The  dimension¬ 
ality  of  a  dynamic  subarray  is  one  unit  less  than  that  of  the  reference  array, 
which  is  the  reason  for  the  existence  of  the  number  of  the  measurement  of  the 
reference  array,  which  is  not  found  on  the  left  side  of  the  description  of  the 
dynamic  subarray.  In  place  of  this  measurement  of  the  reference  array  appears 
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the  name  of  the  associated  integer  variable  (the  same  one) ,  while  in  place  of 
the  other  measurements  of  the  reference  array  appears  the  symbol  *. 

Example: 

/iDG/  a[}.0  (HOMEP)  = 

Key: 

a.  Number 

The  partition  of  invoked  modules  describes  the  structure  of  formal  parameters 
of  the  modules  being  called  up.  The  information  of  this  partition  is  used  as 
a  compiler  in  compiling  the  object  modules.  The  structure  of  the  actual 
parameters  in  the  module  call  operator  should  agree  with  the  structure  of  the 
formal  parameters  specified  in  the  invoked  module  partition. 

Operators 

Five  basic  kinds  of  operator  exist  in  the  underlying  language: 

-  the  assignment  operator; 

-  the  module  call  operator; 

-  the  return  from  module  operator; 

-  the  branch  operator; 

-  the  exchange  operator. 

In  accordance  with  the  types  of  variables  appearing  in  the  assignment 
operator,  we  distinguish  three  types  of  assignment  operator: 

-  the  integer  type  assignment  operator; 

-  the  scalar  type  assignment  operator; 

-  the  array  type  assignment  operator. 

Furthermore,  the  language  has  a  construct  which  combines  two  array  type 
assignment  operators,  enabling  simultaneous  execution  of  both  operators.  This 
construct  is  known  as  the  DOUBLET.  For  example,  simultaneous  execution  of  two 
assignment  operators  of  array  type: 

X:  -  Y+Z 
Q:  «  (Q+R+S)/T, 

where  X,Y,  Z,  Q,  R,  S,  T  are  one -dimensional  or  two-dimensional  arrays,  is 
specified  by  a  single  DOUBLET  construct: 

DOUBLET  (X:  -  Y+Z,  Q:  *  (Q+R+S)/T) 

The  DOUBLET  construct  is  defined  as  a  compact  notation  of  a  set  of  EAOs, 
obtained  by  unifying  the  sets  of  EAOs  of  two  AOATs .  The  execution  of  this 
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construct  is  defined  as  an  asynchronous  parallel  execution  of  the  EAOs  of  a 
unified  set. 

The  right  side  of  the  assignment  operator  is  an  expression  of  appropriate 
type.  Different  groupings  of  two-place  operations  are  generally  involved  in 
the  formation  of  expressions  of  different  types.  The  grouping  of  operations 
of  integer  type  includes: 

+  -  the  operation  of  integer  addition; 

-  -  the  operation  of  integer  subtraction; 

*  -  the  operation  of  integer  multiplication; 

&  -  the  operation  of  bitwise  logical  multiplication; 

%  -  the  operation  of  bitwise  logical  addition; 

#  -  the  operation  of  bitwise  logical  comparison; 

O  -  the  operation  of  bitwise  logical  shifting. 

In  the  shift  operation,  the  number  being  shifted  is  the  first  operand,  while 
the  magnitude  and  direction  of  the  shift  are  the  second  operand  (negative 
value  means  shift  to  left) . 

The  grouping  of  scalar  type  and  array  type  operations  includes: 

addition,  subtraction,  multiplication; 

/  -  division; 

X  -  cutoff  maximum  (AXB  means  MAX  (A-B,0)); 

6c,  %,  #  -  bitwise  logic  operations; 

O  -  shift. 

Furthermore,  three  functions  of  a  single  argument  may  be  involved  in  forming  a 
scalar  type  expression:  SQUARE,  DELTA,  and  INT.PART  [KVARDAT,  DELTA,  and 
TsChAST] .  The  result  of  the  function  DELTA  is  a  real-valued  unit,  if  the 
arguunent  is  less  than  zero,  and  a  real -valued  zero  otherwise.  The  result  of 
the  function  INT.PART  is  the  closest  integer  to  the  argument,  the  modulus  of 
which  does  not  exceed  the  modulus  of  the  argument,  and  the  resulting  integer 
is  represented  as  a  normalized  real -valued  nrunber.  The  function  SQUARE 
enables  a  highly  effective  (without  using  intermediate  variables)  computation 
of  the  square  of  a  value. 

Three  functions  of  a  single  argument  may  be  involved  in  forming  array  type 
expressions:  DELTA,  INT.PART,  and  ARR  [DELTA,  TsChAST,  and  MAS].  The 

argument  and  result  of  the  DELTA  (INT.PART)  function  is  an  array,  the  result 
being  a  componentwise  application  of  the  scalar  function  DELTA  (INT.PART)  to 
the  elements  of  the  argument -array.  The  argument  of  the  function  ARR  is  a 
scalar,  and  the  result  is  an  array,  the  structure  of  which  depends  on  the 
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context  (since  the  operations  in  the  expression  are  performed  on  arrays  with 
identical  structure) . 

All  the  operations  and  functions  involved  in  formation  of  array  type  expres¬ 
sions  are  divided  into  groups: 

-  multiplication  (*) ; 

-  addition  (+,  X,  INT.PART); 

-  division  (/) ; 

-  logic  (DELTA,  &,%,#,<>); 

-  scalar  (ARR) . 

The  underlying  language  places  certain  restrictions  on  the  form  of  the  array 
type  assignment  operator  and  the  DOUBLET  construct.  In  one  assignment 
operator  (or  two  operators  of  the  DOUBLET  construct)  there  should  be: 

“  not  more  than  two  operations  of  the  multiplication  group; 

-  not  more  than  two  operations  of  the  addition  group; 

-  not  more  than  one  operation  of  the  division  group; 

-  not  more  than  one  operation  of  the  logic  group; 

-  not  more  than  four  operations  of  the  scalar  group; 

-  not  more  than  six  entries  of  names  of  arrays  in  the  right-hand  part  of  the 
operator  (right-hand  sides  of  the  DOUBLET  construct) . 

By  operation  here  is  meant  both  a  two -place  operation  and  a  function  of  one 
argument . 

Expressions  are  evaluated  in  accordance  with  the  priorities.  The  highest 
priority  goes  to  functions  of  one  argument  and  expressions  enclosed  in 
brackets  (priority  0) .  Two-place  operators  have  the  following  priorities: 

&  -  priority  1; 

%  -  priority  2; 

#  -  priority  3; 

O  -  priority  4; 

*  -  priority  5; 

/  -  priority  5; 

+  -  priority  6; 
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-  -  priority  6; 

X  -  priority  6. 

The  lowest  priority  (7)  is  of  the  group  of  relation  operations: 

«=  -  equals; 

/=  -  not  equal; 

<  -  less  than; 

<=  -  less  than  or  equal  to; 

>  -  greater  than; 

>=  -  greater  than  or  equal  to. 

Relation  operations  may  occur  only  in  a  conditional  jump  operator  and  an 
exchange  operator. 

All  operations  on  arrays  are  performed  componentwise,  i.e,,  an  analogous 
scalar  operation  is  applied  to  the  like  components  of  the  arrays.  In  an  array 
type  assignment  type  operator  (and  DOUBLET  construct) ,  all  arrays  should  have 
identical  dimensionality  and  identical  quantities  of  elements  in  the  like 
measurements . 

The  module  call  operator  contains  factual  parameters,  and  the  structure  of  the 
factual  parameters  should  agree  with  that  of  the  formal  parameters  of  the 
called  module.  A  recursive  module  call  is  permitted.  The  return  from  module 
operator  returns  control  to  the  calling  module. 

The  branch  operators  are  divided  into  conditional  and  unconditional.  The 
conditional  branch  operator  passes  control  to  a  label  within  the  given  module 
if  the  condition  is  true  and  passes  control  to  the  next  operator  if  the 
condition  is  false.  There  are  two  types  of  condition:  integer  and  scalar. 

The  exchange  operator  has  two  modifications  (input  and  output) .  The  input 
operator  assigns  to  an  array  type  variable  the  value  of  an  external  array  type 
variable.  The  output  operator  assigns  to  an  external  array  type  variable  the 
value  of  an  array  type  variable.  If  the  condition  field  in  the  exchange 
operator  is  not  empty,  the  operator  is  performed  when  the  condition  is  true. 

If  the  condition  is  false,  the  operator  is  not  performed  (it  is  passed  over). 
An  exchange  may  specify  recoding  into  the  format  of  the  Elbrus  CP  (E)  or  the 
Elbrus  SVS  (B) . 

Exchange  operators  are  performed  in  strict  sequence,  i.e.,  the  next  exchange 
operator  will  be  performed  only  after  total  completion  of  the  exchange  of  the 
previous  exchange  operator.  An  exchange  operator  initiates  the  current 
exchange  and,  without  waiting  for  total  completion  of  the  current  exchange, 
passes  control  to  the  next  operator.  The  operator  SYNCHRO  is  used  to 
synchronize  the  exchange  and  the  computations,  guaranteeing  that  the  next 
operator  will  be  initiated  only  after  total  completion  of  the  current 
exchange . 
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Additional  Capabilities 

The  operator  SPECl  makes  it  possible  to  form  a  64-bit  scalar  value  from  32 -bit 
integer  values.  The  first  integer  expression  forms  the  high-order  bits  of  the 
scalar,  the  second  the  low-order  bits.  The  operator  SPEC2  makes  it  possible 
to  slice  a  64-bit  scalar  value  into  two  32 -bit  elements  and  to  store  them  in 
integer  variables.  The  first  integer  variable  is  assigned  the  value  of  the 
high-order  bits  of  the  scalar,  the  second  the  value  of  the  low-order  bits. 

The  integer  function  of  two  argiaments  INV  performs  a  double  inversion  of  the 
value  of  the  first  argtiment  and  then  shifts  the  result  to  the  right  by  the 
number  of  bits  indicated  in  the  second  argument. 

For  array  type  variables,  along  with  the  DOUBLET  construct  there  is  yet 
another  construct  (FORK) ,  which  in  certain  cases  can  boost  the  efficiency  of 
the  program.  This  construct  is  subject  to  the  same  quantitative  and  qualita¬ 
tive  restrictions  on  the  makeup  of  the  operations  as  the  DOUBLET  construct  and 
the  array  t3rpe  assignment  operator.  The  efficiency  is  enhanced  by  using 
common  subexpressions  within  a  single  FORK  construct.  The  FORK  construct 
contains  one  or  two  array  type  assignment  operators.  In  the  case  of  two 
operators  (as  in  the  DOUBLET  construct) ,  both  operators  will  be  performed  at 
the  same  time.  However,  unlike  the  DOUBLET  construct,  both  operators  may 
contain  common  subexpressions  that  are  evaluated  once.  The  array  type 
expression  is  specially  amplified  for  the  FORK  construct.  Thus,  notations  of 
type: 


[<DIGITXA.EXPRESSION>]  and  [<DIGIT>] 

are  array  type  expressions,  the  first  notation  defining  a  common  subexpression 
and  assigning  to  this  subexpression  a  number,  while  the  second  notation  is  a 
reference  to  the  defined  subexpression.  For  example,  two  assignment  operators 
of  array  type: 


F  :  »  A/((B+C)*(C-D)), 

G  :  =  (B+C)*(C-D)*E 

can  be  replaced  with  a  single  FORK  construct: 

FORK  (F  :  -  A/[1(B+C)*(C-D) ] ,  G  :  -  [1]*E) 
or 

FORK  (F  :  =  A/[l(B-f[2C]  )*(  [2] -D)  ]  ,  G  :  «  [1]*E), 

the  latter  construct  being  preferable,  since  (other  conditions  being  equal) 
there  are  fewer  names  of  arrays  on  the  right  side  of  the  assignments.  The 
first  FORK  construct  corresponds  to  the  processing  pipeline  configuration 
shown  in  Fig.  2,  the  second  FORK  construct  corresponds  to  the  processing 
pipeline  configuration  in  Fig.  3. 

The  digits  in  the  square  brackets  of  the  FORK  construct  project  the  points  of 
branching.  The  second  FORK  construct  has  an  important  attribute:  two 
different  paths,  issuing  from  the  same  branching  point  (ntunber  2),  converge  on 
the  same  functional  device  (multiplier),  forming  a  so-called  contour.  All 
contours  of  the  processing  pipeline  should  be  balanced,  i.e.,  both  paths 
forming  a  contour  should  have  the  same  length.  The  path  length  is  equal  to 
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the  sum  of  delays  of  the  functional  devices  through  which  the  path  passes.  A 
delay  line,  which  may  have  a  length  of  8,  16,  24,  32  (respectively  1,  2,  3,  4 
in  the  underlying  language) ,  is  used  to  equalize  the  path  length.  In  the 
language,  a  delay  line  corresponds  to  a  function  of  two  arguments  DELAY  (3). 
The  first  argument  of  the  delay- function  is  the  magnitude  of  the  delay,  while 
the  second  argument  is  an  array  type  expression,  the  values  of  which  are 
"delayed."  The  del ay -function  is  part  of  the  group  of  logic  operations. 

Since  all  functional  devices  (except  the  divider)  have  a  delay  of  8  (the 
divider  has  16),  two  array  type  assignment  operators: 

G  :  =  A/B*(B+C), 

H  :  =  (D+E)*F 

are  equivalent  to  the  FORK  construct: 

FORK  (G  :  =  A/[1B]*3(1, [1]+C) ,H:  =  (D+E)*F), 


it  being  impossible  in  this  case  to  forego  a  contour  (i.e.,  by  using  a  DOUBLET 
construct) ,  since  more  than  six  names  of  arrays  would  then  appear  on  the  right 
sides.  This  FORK  construct  corresponds  to  the  processing  pipeline  configura¬ 
tion  shown  in  Fig.  4.  The  semantics  of  the  FORK  construct  is  analogous  to  the 
semantics  of  the  array  type  assignment  operator  (AOAT)  and  the  DOUBLET 
construct. 


Fig.  2 


Fig.  3 
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Fig.  4 


The  CONVOLUTION  operator  makes  it  possible  to  convolute  the  array  result  of 
evaluation  of  a  given  expression  by  a  single  operation  (+,  *,  %,  #,  &) , 
writing  the  result  as  a  scalar  variable.  For  example,  the  operator: 

CONVOLUTION  (C:  =  +[A*B]) 

will  write  the  value  of  the  scalar  product  of  vectors  A  and  B  into  the  scalar 
variable  C. 

Not  more  than  one  operation  of  the  logic  group  may  appear  on  the  right  side  of 
the  operator  CONVOLUTION. 

Syntax 

In  describing  the  syntax,  an  effort  is  made  to  portray  the  possible  arrange¬ 
ment  of  the  described  constructs  in  the  lines  of  a  program.  Therefore,  if  a 
certain  alternative  occupies  several  lines ,  this  means  that  the  concepts  of 
which  it  is  formed  should  begin  with  a  new  line.  An  obvious  exception  to  this 
rule  is  when  the  text  of  the  alternative  does  not  fit  on  one  line. 

An  identifier  is  a  sequence  of  letters  or  digits,  beginning  with  a  letter. 

The  length  of  an  identifier  is  not  more  than  eight  characters. 

<PROGRAM>  : :  =  %PROGRAM(<IDENTIFIER>) 

<PARTITION  EX.  MEMORY> 

<DESCRIPTIONS> 

<PARTITION  OF  OPERATORS> 

%END 

<MODULE>  : :  =  %MODULE(<IDENTIFIER>) 

<PARTITION  OF  PARAMETERS> 

<DESCRIPTIONS> 
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<PARTITION  OF  OPERATORS> 

%END 

<DESCRIPTIONS>  : :  =  <PARTITION  OF  LOCALS> 

<PARTITION  OF  SUBARRAYS> 

<PARTITION  OF  MODULES> 

<PARTITION  OF  MACRODEFINITIONS> 

<PARTITION  OF  PARAMETERS>  : :  =  %PARAMETERS 

<FORM . PARAMETERS> | 

<FORM.PARAMETERS>  ::  =  <FORM . PARAMETER> | 

<FORM . PARAMETER> . <FORM . PARAMETERS> | 

<FORM . PARAMETERS> 

<FORM . PARAMETERS> 

<FORM.PARAMETER>  ::  =  <INT.DESCRIPTION>|<AR.DESCRIPTION>| 

<S . DESCRIPTION> 

<INT.DESCRIPTION>  ::  ■=  /INT/  <LIST  OF  IDENTIFIERS> 

<LIST  OF  IDENTIFIERS>  ::  =  <IDENTIFIER>| 

<IDENTIFIER>,<LIST  OF  IDENTIFIERS> 

<AR.DESCRIPTION>  ::  =  <TYPE  OF  ARRAYXLIST  OF  IDENTIFIERS> 

<TYPE  OF  ARRAY>  =  /AR<DIMENSION>/|/EX.AR<DIMENSION>/ 

<DIMENSION>  ::=l|2|3|4 

<S . DESCRIPTION>  =  /S/<LIST  OF  IDENTIFIERS> 

<PARTITION  OF  LOCALS>  : :  =  %LOCAL 

<LOCAL  VARIABLES> 

<LOCAL  VARIABLES>  : ;  =  /INT/<INT . C . DESCRIPTION> 

/S/<S . C , DESCRIPTION> | 

<INT . DESCRIPTION> | <S . DESCRIPTION> | 
<L.AR.DESCRIPTION>| 

<LOCAL  VARIABLES>,<LOCAL  VARIABLES>| 

<LOCAL  VARIABLES> 

<LOCAL  VARIABLES> 

<INT . C . DESCRIPTION>  ::  =  <IDENTIFIER>  =  <INT . CONSTANT> | 

<IDENTIFIEEl>  =  <INT.CONSTANT>,<INT.C.DESCRIPTION> 
<INT . GONSTANT>  ::  =  <D.NUMBER>| '<H.NUMBER>' j -<D.NUMBER> 

<D.NUMBER>  ::  =  <DIGIT> | <DIGITXD . NUMBER> 

<H.NUMBER>  =  <H.DIGIT>|<H.DIGITXH.NUMBER> 

<DIGIT>  ::=0|l|2|3|4|5|6|7|8|9 
<H.DIGIT>  ::  =  <DIGIT>  a|b|g|d|e|F 
<S.G.DESGRIPTION>  ::  =  <IDENTIFIER>  =  <S . GONSTANT> | 

<IDENTIFIER>  =  <S .GONSTANT>,<S .G .DESGRIPTION> 
<S.GONSTANT>  ::  =  <D . NUMBER> . <D . NUMBERXEXPONENT> | 

-  <D .  NUMBER> .  <D .  NUMBERXEXPONENT>  | 

'<H.NUMBER>' 

<EXPONENT>  ::  =  I  e<+-XD.NUMBER> 

<L.AR.DESGRIPTION>  ::  =  /AR<DIMENSION>/<LIST  LOG.AR.> 

<LIST  LOG.AR.>  ::  =  <LOG. ARRAY. | 

<L0G.ARRAY>,<LIST  LOG.AR.> 

<LOG.ARRAY>  -  see  "descriptions"  section 
<PARTITION  OF  SUBARRAYS>  : :  =  %SUBARRAYS 

<DESG.SUBARRAYS>| 

<DESG . SUBARRAYS>  ::  =  <SUBARRAYS>| 

<SUBARRAYS> , <DESG . SUBARRAYS> | 

<DESG.SUBARRAYS> 

<DESG.SUBARRAYS> 

<SUBARRAYS>  ::  =  <TYPE  OF  SUBARRAYXLIST  OF  DYN.SUBAR.>| 


52 


<TYPE  OF  SUBARRAYXLIST  OF  STAT.SUBAR.> 
<LIST  OF  STAT.SUBAR.>  ::  -  <STAT . SUBARRAY> | 

<STAT . SUBARRAY> , <LIST  STAT . SUBAR . > 
<LIST  DYN. SUBAR. >  ::  =  <DYN . SUBARRAY> | 

<DYN.SUBARRAY>,<LIST  DYN. SUBAR. > 
<PARTITION  OF  MODULES>  : :  =  %MODULES 

<DESCRIPTIONS  OF  M0DULES>1 
<DYN . SUBARRAY>  -  see  "descriptions"  section. 

<STAT . SUBARRAY>  -  see  "descriptions"  section. 

<DESCRIPTIONS  OF  MODULES>  ::  -=  <DESCRIPTION  OF  MODULE>| 

<DESCRIPTION  OF  MODULE> 
<DESCRIPTION  OF  MODULES> 
<DESCRIPTION  OF  MODULE>  : ;  =  /MOD/<IDENTIFIER>  (<PARAMETERS>) 
<TYPE  OF  SUBARRAY>  ::  =  /D.AR.<DIMENSI0N>/1 

/D.EX.AR.<DIMENSION>/|<TYPE  OF  ARRAY> 
<PARAMETERS>  =  <PARAMETERS> | <PARAMETER> , <PARAMETERS> 
<PARAMETER>  ::  =  <NAME  OF  INTEGER> | <NAME  OF  SCALAR>| 

<NAME  OF  INT.CONSTANT>l<NAME  OF  S.CONSTANT>| 
<NAME  OF  ARRAY>|<NAME  OF  EXTERNAL  ARRAY> 
<PARTITION  OF  OPERATORS>  : :  ■=  %OPERATORS 

<OPERATORS> 

<OPERATORS>  : :  =  <LABELED  OPERATOR> 

<OPERATORS>| 

<LABELED  OPERATOR>  : :  =  <LABEL>  :  <OPERATOR> | <OPERATOR> 


<LABEI>  : :  =  <IDENTIFIER> 

<OPERATOR>  ; :  =  <AS  S IGNMENT> | <CALI>  | <RETURN> ] 

<BRANCH> 1 <EXCHANGE> | <DOUBLET> \ <SYNCHRO> | 

<FORK> I <GONVOLUTION> | <SPEC . OPERATOR> | 

<MACROCALLS> 

<ASSIGNMENT>  ::  =  <INT.ASSIGNMENTS>1 
<S.ASSIGNMENTS>1 
<AR.ASSIGNMENT> 

<INT.ASSIGNMENTS>  ::  =  <INT.ASSIGNMENT>,<INT. ASSIGNMENTS>| 

<INT.ASSIGNMENT> 

<INT.ASSIGNMENT>  ::  =  <NAME  OF  INTEGER>:=<INT. EXPRESS ION> 

<CONVOLUTION>  ::  =  CONVOLUTION  (<NAME  OF  SCALAR>;=<OPER.>[<CONV. EXP .>] ) 
<OPER.>  ;:=+j*l%|#|& 

<CONV.EXP.>  ::  =  <NAME  OF  ARRAY>| 

<NAME  OF  ARRAYXAR.OPERATIONXNAME  OF  ARRAY> 

<INT. EXPRESS ION>  ::  -  <NAME  OF  INTEGER> | <NAME  OF  INT . CONSTANT> | 

<INT.CONSTANT>| (<INT.EXPRESSION>) | 

<NAME  OF  ARRAY>.<NUMBER  OF  MEASUREMENT> | 

<INT .  EXPRESS  lONXiNT .  OPERATIONXINT .  EXPRESSION>  \ 
INV (<INT . EXPRESSION> , <D . NUMBER>) 


<NUMBER  OF  MEASUREMENT> 
<INT.OPERATION> 
<S.ASSIGNMENTS> 

<S.ASSIGNMENT> 


<INDEXES> 


=  12  3  4 

=+|-l*&%  #|o 

=  <S.ASSIGNMENT> 

<S . ASSIGNMENT>,<S . ASSIGNMENTS> 

=  <NAME  OF  SCALAR>  :=<S . EXPRESS ION>| 

<VARIABLE  WITH  INDEXES>  :=<S . EXPRESSION> 
<VARIABLE  WITH  INDEXES>  : :  -  <NAME  OF  ARRAY>[<INDEXES>] 

<INT .  EXPRESSIONXINDEXES>  | 

<INT.EXPRESSION> 

::  =  <NAME  OF  SCALAR>|<NAME  OF  S.CONSTANT>| 


<S.EXPRESSION> 
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<VARIABLE  WITH  INDEXES>| 

(<S . EXPRESSION>) I <S . FUNCTION> | 

<S  .  EXPRESSIONXS  .  OPERATIONXS  .  EXPRESSION> 
<s.OPERATlON>  ::=+!  -  |*|/|x|&|%|#|o 
<S.FUNCTION>  ::  =  DELTA (<S. EXPRESS ION>) I 

INT. PART(<S . EXPRESS ION>) | 

SQUARE (<S . EXPRESS ION>) 

<AR.ASSIGNMENT>  ::  -  <NAME  OF  ARRAY>  :=<AR. EXPRESS ION> 

<AR. EXPRESS ION>  ::  =  <NAME  OF  ARRAY>|<AR.FUNCTION>| 

(<AR. EXPRESS ION>) | 

<AR .  EXPRESSIONXAR .  OPERATIONXAR .  EXPRESSION>  I 
[<DIGITXAR. EXPRESS ION>]  |  [<DIGIT>] 

<AR.FUNCTION>  ::  •=  DELTA(<AR. EXPRESSION>) | 

INT. PART (<AR.EXPRESSION>) | 

AR. (<NAME  OF  SCALAR>) | 

DELAY (<D . NUMBER> , CAR . EXPRES  S ION>) 

<CALL>  ::  =  CNAME  OF  MODULE>  (<F. PARAMETERS>) 

<F . PARAMETERS>  ::  =  <F . PARAMETER> , <F . PARAMETERS> | 

<F.PARAMETER> 

<F.PARAMETER>  ::  =  CNAME  OF  ARRAY>|ciNT. EXPRESS I ON>| 

CNAME  OF  SCALAR> 

CAR.OPERATION>  ::=+|  -  |*|/|x|&|%|#|c> 

CRETURN>  : :  =  RETURN 

CGO  TO>  : :  =  CONTROL(CCONDITION>)  CLABEL> 

CCONDITION>  ::  =  CINTEGER  CONDITION>|cSCALAR  CONDITION>| 

CINTEGER  CONDITION>  ;;  =  ClNT.EXPRESSIONXRELATIONXlNT.EXPRESSION> 

CSCALAR  CONDITION>  : :  =  CNAME  OF  SCALAR>CRELATION>  0 
CRELATION>  : :  =  C  I  >  I  -  I  >=  I  C=  I  /= 

CEXCHANGE>  ::  =  INPUT (CUO>) (CRECODE>) (CNAME  OF  ARRAY>, CNAME  OF  EXTERNAL 
ARRAY>) 

COUTPUT>(CUO>) (CRECODE>) (CNAME  OF  ARRAY>, CNAME  OF  EXTERNAL  ARRAY>) 
CDOUBLET>  : :  ■=  DOUBLET(CAR. ASSIGNMENT>,CAR. ASSIGNMENT>) 

CUO>  :  :  ■=  CCONDITION> 

CRECODE>  : :  =  E  I  B  I 

CSYNCHRO>  : :  =  SYNCHRO 

CFORK>  ::  =  FORK(CAR.ASSIGNMENT>) | 

FORK(CAR . ASSIGNMENT> , CAR . ASSIGNMENT>) 

CSPEC . OPERATOR>  ::  =  SPEC1(CNAME  OF  SCALAR>:=<INT. EXPRES SION>, 
ClNT.EXPRESSION>) 

SPEC2 (CNAME  OF  INTEGER> , CNAME  OF  INTEGER> : =CNAME  OF  SCALAR>) 
CPARTITION  OF  EX.MEMORY>  : :  =  %MEMORY 

CEX.ARRAYS>| 

CEX.ARRAYS>  ::  =  CDESCRIPTION  EX.AR.>| 

CDESCRIPTION  EX . AR . > , CEX . ARRAYS> | 

CEX.ARRAYS> 

CEX.ARRAYS> 

CDESCRIPTION  EX.AR,>  ::  =  /EX. AR.CDIMENSION>/CLIST  LOC.AR.> 

CPARTITION  OF  MACRODEFINITIONS>  : :  =  %MACRODEFINITIONS 

CDEF.MACROBASED  CROSSES>| 

CDEF.MACROBASED  CROSSES>  ::  ==  CDEF.MACROBASED  CROSS>. 

% 

CDEF.MACROBASED  CROSSES>| CDEF.MACROBASED  CROSS> 
CDEF.MACROBASED  CROSS>  : :  =  MACROBASED  CROSS  CNAME  OF  MACROBASED  CROSS> 

CTEXT  OF  MACRODEFINITION> 
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<NAME  OF  MACROBASED  CROSS>  : ;  =  <IDENTIFIER> 

<MACROCALL>  : :  =  %<ARRAY/MACHINE  CODE>  MACROBASED  CROSS  <PRINT  MODE> 
<NAME  OF  MACROBASED  CROSS>(<F.PARAMETERS>) 

<PRINT  MODE>  ; :  =  P  I 
CARRAY/MACHINE  CODE>  : :  =  j  A 


The  parameters  and  local  labels  appearing  in  the  text  of  the  macrodefinition 
are  signified  respectively  by:  %<NUMBER  OF  PARAMETER>  and  ; <D . NUMBER> .  The 
parameters  are  numbered  from  left  to  right  (starting  at  one) .  The  macrocall 
specifies  the  language  in  which  the  macrodefinition  is  written  and  indicates 
the  necessity  of  printing  out  the  text  of  the  macrosubstitution. 

The  order  of  description  of  local  variables  and  the  order  of  variables  in  the 
list  of  parameters  of  a  module  are  as  follows: 

1)  scalars; 

2)  integers: 

3)  arrays. 

An  empty  condition  in  the  branch  and  exchange  operators  indicates  uncondi¬ 
tional  execution  of  the  given  operators. 

The  powers  of  expression  of  the  underlying  language  were  tested  in  the 
programming  of  the  solution  of  the  Navier-Stokes  equations  [3] .  Translators 
of  the  underlying  language  have  been  created  in  the  BESM-6  and  the  Elbrus  MCC. 
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GENERAL 


EXPERIENCE  IN  CREATING  THE  SOFTWARE  OF  THE  VECTOR  PROCESSOR 

907G0089C  Moscow  SUPER- EVM  in  Russian,  pp.  49-66 

[Article  by  V.  D.  Asriyeli,  A.  V.  Myasnikov,  and  T.  A.  Popov] 

[ Text ]  Introduction 

The  creation  of  the  software  of  the  vector  processor  (VP)  is  part  of  a  many 
year  project,  the  goal  of  which  was  to  develop  a  VP  for  the  Elbrus -2  multi¬ 
processor  computing  complex  (MCC) .  The  development  of  the  VP  for  the  Elbrus-2 
MCC  culminated  in  the  creation  of  design  documentation  for  factory  manufac¬ 
ture,  creation  of  a  set  of  utilities  for  development  of  programs  for  the  VP  in 
the  BESM-6,  and  implementation  of  the  language  components  of  the  programming 
system  in  the  Elbrus -2  MCC.  The  course  of  development  of  the  VP,  as  well  as 
several  intermediate  results,  are  described  in  part  by  the  available  publica¬ 
tions  [1-6] . 

The  special  nature  of  the  VP  software  comes  from  the  fact  that  the  VP  is  not  a 
stand-alone  computing  system,  but  a  processor  that  is  incorporated  in  place  of 
one  of  the  central  processors  in  the  Elbrus-2  MCC.  Creation  of  a  stand-alone 
system  with  new  architecture  requires  development  of  a  full  volume  of  software 
and  confronts  the  designers  with  the  problem  of  transferring  the  existing 
application  programs.  Resolution  of  this  problem  largely  determines  the 
competitiveness  of  the  new  system.  From  this  perspective,  the  MCC  with  VP  has 
an  indubitable  advantage,  since  it  does  not  require  implementation  of  the  full 
volume  of  software  for  the  VP  and  enables  a  smooth  translation  to  the  VP  of 
those  tasks  that  are  not  handled  efficiently  in  the  MCC  without  VP  (e.g.,  too 
much  time  is  required  for  their  handling).  With  such  approach,  there  is  no 
need  to  transfer  the  other  tasks  to  the  VP. 

For  the  purpose  of  program  transfer,  stand-alone  vector  systems  implement 
translators  of  the  commonly  used  programming  languages  with  capabilities  of 
automatic  vectorization.  However,  this  same  utility  is  employed  to  create  new 
programs  for  the  stand-alone  vector  systems.  Such  approach  is  not  the  only 
one  possible  for  the  VP  within  the  MCC.  Under  certain  conditions,  it  is 
economically  more  advantageous  for  the  MCC  with  VP  to  create  new  programs  in 
languages  that  have  been  developed  to  allow  for  the  specific  features  of 
vector  processing.  Therefore,  the  main  attention  in  creating  the  software  for 
the  VP  was  concentrated  on  the  programming  system  and  the  language  resources 
for  achievement  of  maximum  effective  utilization  of  the  VP.  The  VP  is 
oriented  to  solving  computational  problems,  and  all  functions  involving 
organization  of  computations  in  the  MCC  with  VP  have  been  delegated  to  the  OS 
of  the  Elbrus  MCC.  The  possibility  of  realizing  a  full-valued  complex  is 
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guaranteed  by  the  presence  of  the  system  program  SUPERVISOR,  operating  in  the 
VP  and  interacting  with  the  OS  of  the  HOC  by  a  simple  protocol  using  mail 
boxes,  semaphores,  and  program  interrupts. 

The  work  on  the  software  of  the  VP  was  carried  out  along  the  following  lines. 

1.  Investigations  in  the  area  of  program  layouts  in  order  to  select  a  layout 
adequately  reflecting  the  processes  of  information  processing  in  vector 
computers  of  all  types  (both  pipeline  and  array  types)  and  enabling  effective 
realization  in  vector  computers  with  programmable  configuration  of  the 
processing  pipeline. 

2.  Investigations  in  the  area  of  programming  languages  for  vector  computers, 
including  development  of  constructs  of  high-level  programming  language  to 
ensure  maximum  effective  use  of  vector  computers  with  programmable  configura¬ 
tion  of  the  processing  pipeline. 

3 .  Development  of  a  technology  for  creation  of  translators  and  verification  of 
same  during  realization  of  the  VP  programming  system. 

4.  Practical  realization  of  a  program  development  complex  in  the  BESM-6  in^ 
order  to  carry  out  experiments  and  provide  the  potential  user  with  a  function¬ 
ally  complete  means  of  development  and  debugging  of  application  programs 
designed  for  solution  in  the  VP  of  the  Elbrus  MCC. 

1.  Layout  of  the  Programs,  Programming  Language,  and  Translator  Design 
Technology 

1.1.  The  Underlying  Program  Layout 

Almost  all  existing  languages,  including  graphic  means  of  processing  of 
arrays ,  as  well  as  programming  languages  of  vector  computers ,  belong  to  the 
languages  with  consecutive  operator  layout  of  the  programs,  i.e.,  they  have  a 
layout  similar  to  the  traditional  scalar  processing  languaps  (ALGOL,  FORTRAN, 
etc.).  The  hallmark  of  all  the  above  languages  is  the  definition  of  the 
semantics  for  processing  of  arrays.  In  fact,  we  propose  to  divide  the  entire 
collection  of  programming  languages  with  means  of  processing  arrays  into  three 
groups  according  to  the  method  of  definition  of  the  vector  processing^ 
semantics  (more  precisely,  the  method  of  definition  of  the  vector  assignment 
semantics)  and,  thus,  to  speak  of  three  program  layouts  underlying  the 
programming  languages  with  array  processing  means.  Apart  from  this,  all  three 
program  layouts  have  an  identical  consecutive  semantics  of  execution  of 
operators,  i.e.,  the  following  operator  will  be  executed  after  completion  of 
execution  of  the  previous  operator. 

The  first  group  of  languages  interprets  the  assignment  operator,  defined  on 
arrays ,  as  a  compact  notation  for  the  operator  of  the  cycle .  An  example  of 
the  first  group  of  languages  is  PL/I. 

The  second  group  of  languages  prescribes  evaluating  the  expression  on  the 
right  side  of  the  assignment  operator,  followed  by  transferring  the  values  of 
the  resulting  array  of  such  computation  to  the  array  defined  by  the  left  side 
of  the  assignment  operator.  This  most  extensive  group  of  languages  includes, 
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in  particular,  ALGOL- 68  and  the  majority  of  programming  languages  of  vector 
computers . 

In  the  third  group  we  shall  place  languages  in  which  the  vector  assignment 
operator  is  a  compact  notation  for  a  set  of  elementary  assignment  operators 
(EAOs),  defined  on  the  like  components  of  vectors,  and  the  execution  of  a 
vector  assignment  operator  is  defined  as  an  asynchronous  parallel  execution  of 
the  EAOs  which  comprise  it.  Such  semantics  is  generalized  in  natural  manner 
to  an  assignment  operator  of  array  type  (AOAT) ,  when  multidimensional  arrays 
are  used  on  the  left  and  right  sides  of  the  operator. 

The  third  layout,  i.e.,  a  consecutive  operator  scheme  of  programs  with 
parallel  vector  assignment  operator,  was  chosen  as  the  basis  for  design  of  the 
underlying  programming  language  of  the  VP.  This  scheme  adequately  reflects 
the  process  of  information  processing  in  all  kinds  of  vector  computers  and  can 
be  recommended  as  the  basic  scheme  for  designing  the  programming  languages  of 
vector  computers . 

1.2.  The  Underlying  Programming  Language 

The  underlying  high-level  programming  language  was  developed  to  allow  maximum 
effective  utilization  of  the  VP  with  programmable  configuration  of  the 
processing  pipeline.  A  survey  of  the  basic  constructs  of  this  language  is 
contained  in  [4].  However,  several  modifications  were  made  in  the  final 
version  of  the  underlying  language  for  description  of  arrays  and  subarrays. 

We  therefore  present  examples  of  the  descriptions  of  static  and  dynamic 
subarrays.  Say  B  is  a  three-dimensional  array.  The  description  of  the 
subarray: 


/H2/  Af3,0  = 

then  defines  a  two-dimensional  subarray  A  of  the  reference  array  B.  In  the 
array  B,  the  fifth  slice  is  taken  for  the  first  measurement.  In  the  resulting 
two-dimensional  array,  the  first  and  last  columns  are  removed  and  the 
resulting  array  is  transposed.  The  resulting  array  is  the  subarray  A.  Of 
course,  no  such  displacements  of  the  elements  of  array  B  occur.  The  attri¬ 
butes  of  array  A  are  simply  formulated  in  such  manner  as  to  be  accessible  to 
the  required  subarray.  An  example  of  the  description  of  a  dynamic  subarray  C 
is : 


'/m/  c[2,0(i)  . 

Here,  1  is  an  associated  variable  (integer).  The  subarray  C  is  a  transposed 
two-dimensional  one-th  slice  of  the  array  B  in  the  third  measurement. 

The  underlying  language  allows  effective  realization  of  convolution  type 
operations.  For  example,  the  scalar  product  of  vectors  A  and  B  with  the 
result  placed  in  scalar  X  can  be  written  as: 

CONVOLUTION  (X:  =  +  [A*B]). 

For  the  CONVOLUTION  operator,  the  translator  generates  a  code  that  will  be 
used  to  construct  the  configuration  of  the  functional  devices  with  feedback 
connections,  enabling  realization  of  an  accumulation  mode. 
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The  underlying  language  has  a  standard  mechanism  of  macrosubstitutions,  the 
macrodefinitions  being  taken  from  either  the  macrodefinition  partition  of  the 
module  or  the  library  of  macrobased  crosses. 

1.3.  The  Technology  for  Design  of  Translators 

In  the  design  of  the  VP  software,  major  attention  was  given  to  developing  a 
technology  of  translator  creation  enabling  substantial  reduction  in  the 
implementation  time.  The  objective  was  a  very  quick  construction  of  a  fully 
operational  translator,  perhaps  with  some  reduction  in  the  speed  of  trans¬ 
lation,  but  without  impairing  the  quality  of  the  generated  object  code.  If 
the  translator  has  built  in  means  of  automatic  gathering  of  statistical  data 
to  allow  a  determination  of  the  effectiveness  of  execution  of  the  resulting 
program  and  the  effectiveness  of  the  translation  algorithms,  it  may  be  used  in 
the  early  stage  of  design  of  the  computing  system. 

Indeed,  analysis  of  statistical  data  assembled  during  the  translation  of  a 
rather  representative  collection  of  programs  makes  it  possible  to  identify 
"bottlenecks"  in  the  architecture.  If  the  bottlenecks  are  discovered  early 
enough,  then  the  hardware  can  be  modified,  the  translator  modified,  and  the 
collection  of  statistics  resumed  with  the  same  set  of  actual  programs. 
Comparison  of  the  results  of  experiments  makes  it  possible  to  assess  the 
effectiveness  of  the  changes  introduced  and  select  the  optimal  version  of  the 
architecture.  Furthermore,  the  means  of  gathering  statistical  data  that  are 
built  into  the  translator  allow  an  investigation  of  the  process  of  translation 
and  an  optimization  of  the  translation  algorithms  in  the  same  set  of  actual 
programs . 

Such  approach  is  effective  even  in  the  design  of  computers  with  traditional 
architecture  and  conventional  system  of  instructions,  since  the  overall 
effectiveness  of  the  system  is  optimized  in  an  early  phase.  For  computers 
with  a  set  of  pipeline  devices,  with  programmable  switching,  and  with  "very 
long  instruction  word"  (VLIW  systems),  such  approach  is  evidently  the  only  one 
possible,  since  for  VLIW  systems  the  translator  should  be  designed,  if  not  in 
advance  of  the  hardware,  then  at  least  simultaneous  with  it. 

The  following  technology  for  development  of  translators  is  proposed  to  support 
such  approach  to  the  design  of  computing  systems. 

1.  Create  an  experimental  parametrized  translator  with  built-in  means  of 
automatic  gathering  of  statistical  data. 

2.  Gather  statistics  in  application  program  packages  and  analyze  the  results. 

3.  Feedback  to  the  design  of  the  hardware  (possible  partial  redesign  on  the 
basis  of  the  analysis  findings). 

4.  Feedback  to  creation  of  the  experimental  translator  (variation  of 
parameters;  translator  version  for  the  improved  hardware;  optimization  of 
translation  algorithms  on  the  basis  of  the  analysis  findings) . 

5.  Create  a  commercial  version  of  the  translator. 
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In  order  for  this  technology  to  be  viable,  quick  implementation  and  easy 
modification  of  the  experimental  translator  must  be  secured.  This,  in  turn, 
requires  creation  of  a  set  of  tools,  as  well  as  selection  (or  creation)  of  a 
utility  language  capable  of  meeting  the  above  requirements.  As  the  utility 
language,  we  selected  REFAL/2,  implemented  in  many  Soviet  computers  and  used 
in  the  creation  of  translators.  The  writing  of  translators  in  REFAL  makes 
them  mobile,  so  that  the  technology  is  independent  of  the  specific  development 
computer.  Since  REFAL/2  was  absent  from  the  Elbrus  MCC,  in  the  course  of  the 
work  on  the  VP  software  a  realization  of  REFAL  for  the  Elbrus  MCC  was  imple¬ 
mented  [  7  ]  . 

The  proposed  design  technology  was  tested  in  the  development  of  the  VP.  We 
may  note  the  factors  with  negative  impact  on  the  duration  of  the  design 
process  and  realization  of  the  underlying  language  translator . 

1.  The  start  of  design  of  the  translator  was  delayed,  since  the  underlying 
language  was  being  developed  along  with  the  VP. 

2.  No  experience  in  design  of  optimizing  translators  for  VLIW  systems. 

3.  No  utilities  available  at  the  start  of  the  project. 

4.  No  sufficiently  representative  applied  program  package  in  the  underlying 
language  of  the  VP  necessitated  development  of  a  converter  of  FORTRAN  into  the 
underlying  language. 

Use  of  REFAL  as  the  development  language  made  it  possible  not  only  to  shorten 
the  time  of  implementation  of  the  translator,  but  also  reduce  the  translator 
design  time.  Under  one-  year  was  spent  on  the  design  and  implementation  of  the 
translator  of  the  underlying  language.  The  translator  generates  optimal  code 
for  the  vector  segments  of  the  program  and  carries  out  a  far-reaching 
optimization  during  translation  of  scalar  linear  segments. 

Analysis  of  statistical  data  obtained  in  the  experimental  translator  made  it 
possible  to  identify  certain  bottlenecks  in  the  VP  architecture  at  such  design 
stage  when  partial  redesign  was  comparatively  painless.  Changes  were  made  in 
the  hardware  of  the  VP,  and  the  experiments  to  optimize  the  translation 
algorithms  were  continued  in  the  improved  (with  modified  hardware)  translator. 
Not  all  of  the  discovered  bottlenecks  were  corrected,  but  a  compromise  was 
achieved,  in  which  further  hardware  expenses  (for  the  particular  componentry) 
are  commensurate  with  the  anticipated  outcome  of  the  improvements.  In  any 
case,  the  hardware  designers  were  fully  aware  in  adopting  these  compromise 
decisions . 

The  experience  with  the  design  of  the  VP  confirmed  the  viability  of  the 
proposed  technology  for  design  of  translators.  In  addition,  there  is  every 
reason  to  expect  that  the  period  of  implementation  of  an  experimental  trans¬ 
lator  for  VLIW  systems  can  be  further  reduced,  since  experience  is  now 
available  and  a  set  of  utilities  has  been  created. 

Let  us  note  that  such  technology  in  no  way  obviates  such  reliable  design  tool 
as  the  model  investigation.  On  the  contrary,  the  gathering  and  analysis  of 
statistics  complement  the  model  investigation  of  the  operation  of  the 
processor  in  solving  test  problems.  This  method  is  perhaps  less  exact  than  a 
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model  investigation,  but  it  produces  a  good  picture  (beyond  the  power  of  a 
model  investigation)  of  the  presence  of  bottlenecks  in  the  architecture  of  the 
processor  (in  actual  programs)  and  is  much  more  powerful  than  modeling  in  the 
quantity  of  programs  analyzed. 

We  emphasize  once  more  that  the  proposed  technology  is  especially  effective  in 
design  of  VLIW  systems. 

2.  Program  Development  Utility  Set 

The  VP  development  system  includes  the  following  components: 

-  the  translator  of  the  underlying  language; 

-  the  translator  of  the  machine  code  of  the  VP; 

-  a  compiler  of  object  modules; 

-  an  interpreter  of  the  instruction  system; 

-  an  interactive  debugger; 

-  utility  programs. 

2.1.  The  Underlying  Language  Translator 

The  underlying  language  translator  in  processing  the  descriptions  of  arrays 
and  subarrays  generates  instructions  that  form  the  values  of  the  attributes. 
Only  a  single  register  of  the  index  register  file  (IRF)  is  assigned  by  the 
translator  for  statically  equivalent  attributes.  For  subarrays,  instructions 
are  generated  to  check  when  the  boundary  of  the  reference  array  is  crossed,  if 
such  static  check  is  not  possible.  If  the  name  of  the  dynamic  subarray  is 
found  in  the  partition  of  operators,  a  special  instruction  is  generated  to 
check  whether  the  value  of  the  associated  variable  belongs  to  the  permitted 
range.  Instructions  to  recompute  the  address  of  the  beginning  of  the  dynamic 
subarray  are  generated  only  if  it  is  possible  to  change  the  value  of  the 
associated  variable  after  the  last  recomputation.  For  AOAT,  in  addition  to 
address  instructions  and  partial  reconfiguration  instructions,  the  translator 
generates  an  instruction  to  load  the  length  of  the  vector  and  an  instruction 
to  load  the  time  diagrams  of  the  startup  and  slowdown  phase,  on  the  basis  of 
which  the  VP  hardware  accomplishes  a  dynamic  integration  of  consecutive 
operators  [2]. 

The  single  algorithm  for  translation  of  AOAT,  DOUBLET  and  FORK  constructs 
includes  the  following  main  phases. 

1.  Syntax  check  and  construction  of  a  representation  of  the  computational  grid 
for  the  given  AOAT. 

2.  Processing  of  the  resulting  representation  of  the  grid  with  a  wave 
algorithm.  The  relative  moments  of  time  (in  clock  periods)  for  all  nodes  of 
the  grid  are  determined  and  the  contours  (if  present)  are  checked  for  balance 
[4]. 
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3.  Modification  of  the  grid  to  ensure  the  necessary  separation  of  scalars 
(which  may  occur  on  the  right  side  of  the  AOAT)  in  different  sections  of  the 
real  register  file  (RRF)  and  enable  a  relative  shift  of  nonconnected  subtrees 
(for  the  DOUBLET  construct)  which  appear  in  the  grid. 

4.  Appointment  of  the  physical  nxjunbers  of  the  devices. 

5.  Generation  of  instructions. 

A  general  case  of  scalar  processing  may  be  presented  in  the  form  of  computa¬ 
tions  along  a  linear  sequence  of  assignment  operators  of  scalar  type  (AOST) , 
not  considering  control  transfer  operators,  the  influence  of  which  on  the 
efficiency  of  the  computations  should  be  considered  separately.  The  AOST  may 
include  both  scalar  variables  and  constants  located  in  the  RRF  and  elements  of 
arrays  contained  in  the  RAM  of  the  VP. 

The  underlying  language  translator  performs,  for  the  linear  segment  of  scalar 
processing,  a  planning  of  the  sequence  of  switching  of  the  functional  devices, 
the  sections  of  the  RRF,  and  the  memory  exchange  flows,  taking  into  account 
the  delays  of  the  devices  and  the  following  limitations:  in  each  clock 
period,  only  one  pair  of  operands  may  arrive  at  any  given  functional  device, 
and  each  section  of  the  RRF  may  receive/send  only  one  number.  Further 
limitations  arise  from  the  information  dependents  among  different  AOSTs,  i.e., 
the  need  to  solve  the  problems  of  read  after  write  (RAW) ,  write  after  read 
(WAR) ,  and  write  after  write  (WAW) . 

The  translation  of  each  AOST  is  done  as  follows.  First,  a  representation  of 
the  tree  of  computations  is  constructed,  corresponding  to  the  right  side  of 
the  AOST.  A  correspondence  is  established  between  the  operations  (internal 
nodes  of  the  tree)  and  the  functional  devices  which  realize  them.  The  index 
expressions  are  translated  into  a  sequence  of  instructions,  which  load  the 
working  registers  of  the  IRF  with  the  addresses  of  the  required  elements  of 
the  arrays.  A  syntax  check  is  made  and  the  addresses  of  the  variables  and 
constants  appearing  in  the  AOST  are  taken  from  the  table  of  names  of  the 
translator. 

Next,  a  recursive  circuit  of  the  resulting  tree  is  made  and  the  sequence  of 
switching  of  the  VP  devices  to  accomplish  the  computations  efficiently  for  the 
given  tree  is  determined.  While  moving  from  the  crown  to  the  root  of  the 
tree,  the  possibility  of  a  no-conflict  (with  respect  to  computing  resources) 
integration  of  two  previously  processed  subtree -operands  is  analyzed  for  the 
next  node  when  the  operands  are  simultaneously  presented  at  the  inputs  of  the 
particular  functional  device.  For  the  subtrees  that  have  been  processed,  time 
diagrams  of  the  occupied  resources  have  already  been  constructed  in  the  form 
of  a  linear  sequence  of  binary  scales  (one  scale  corresponds  to  one  clock 
period  of  the  VP) .  The  position  code  in  the  scale  represents  each  of  the 
available  devices  that  is  busy  during  the  particular  clock  period.  Comparing 
the  time  diagrams  period  by  period  makes  it  possible  to  analyze  the  possi¬ 
bility  of  a  combination  of  subtrees. 

If  there  is  a  conflict  in  resources,  an  attempt  is  first  made  to  load  the 
lower  subtree  for  four  clock  periods  (the  amount  of  the  delay  to  write  the 
intermediate  result  in  the  RRF  and  read  it  immediately  into  the  configura¬ 
tion)  .  If  the  loading  for  four  clock  periods  is  prevented  by  resource 
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conflict,  the  lower  subtree  is  moved  away  from  the  higher  one,  first  by  four 
clock  periods  and  then  by  another  clock  period  each  time  until  the  conflict 
disappears.  The  intermediate  result  corresponding  to  the  root  of  the  higher 
subtree  will  be  entered  into  the  RRF  register  and  then  read  from  it.  In 
certain  cases,  a  delay  line  is  used  to  save  the  intermediate  result.  If  the 
next  node  of  the  tree  corresponds  to  a  unary  operation,  no  sorting  through 
alternatives  is  required. 

For  an  individual  AOST,  the  translator  constructs  the  following  structures: 

-  a  list  of  appearances  of  the  scalars  used,  in  which  the  minimum  and  maximum 
reading  time  are  specified  for  each  scalar  used  on  the  right  side  of  the  AOST; 

-  a  time-ordered  list  of  appearance  of  index  variables,  giving  for  each 
appearance  the  number  of  the  flow,  the  address  for  the  IRF  register  in  which 
the  address  for  the  RAM  of  the  VP  is  formed,  and  the  time  of  access  to  the 
flow  (number  of  clock  period) ; 

-  a  list  of  partial  reconfiguration  instructions  for  the  processed  tree  of 
computations; 

-  the  height  of  the  processed  tree  in  clock  periods; 

-  the  time  diagram  of  occupation  of  devices  in  the  above -described  form. 

Translation  of  the  sequence  of  AOSTs  is  done  in  recursive  manner  in  two 
stages:  translation  of  the  next  AOST  and  maximum  possible  "loading"  of  the 

program  for  the  current  AOST  into  the  program  performing  the  computation  of 
the  previous  AOSTs.  Maximum  loading  is  prevented  both  by  possible  resource 
conflicts  and  by  information  dependences.  In  general,  for  a  language  with 
high  degree  of  parametrization,  especially  the  underlying  language  of  the  VP, 
the  memory  addresses  for  the  elements  of  the  array  are  statically  unknown, 
which  enforces  a  rigorous  observance  of  the  sequence  of  exchanges  with  memory 
between  the  various  AOSTs. 

For  the  processed  portion  of  the  linear  sequence  of  AOSTs,  in  addition  to  the 
structures  indicated  for  the  AOSTs  (in  the  present  case  referring  to  the 
entire  processed  portion  of  the  sequence  of  operators)  the  translator  has  a 
list  specifying  the  appearances  of  the  scalar  variables,  in  which  the  address 
in  the  RRF  and  the  maximum  writing  time  at  this  address  are  given  for  each 
scalar  used  on  the  left  side  of  the  operator. 

After  processing  the  current  AOST,  the  translator  loads  it  in  the  already 
processed  sequence  of  AOSTs  as  far  down  as  permitted  by  the  possible  informa¬ 
tion  relationships.  If  the  time  diagrams  conflict  in  resources,  the  current 
AOST  is  "floated  up"  period  by  period  until  the  conflict  vanishes.  The 
algorithm  then  merges  the  described  structures  and  moves  on  to  process  the 
next  operator. 

After  processing  the  entire  sequence  of  AOSTs,  the  instructions  are  generated. 
The  translator  forms  the  sequence  of  instructions  of  the  linear  segment  of  the 
AOST  in  the  same  way  as  a  vector  macrobased  cross,  and  the  hardware  of  the  VP 
accomplishes  the  maximum  possible  dynamic  integration  of  consecutive 
macrobased  crosses  [2]. 
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2.2.  The  Compiler  of  Object  Modules 

The  object  modules  obtained  as  a  result  of  translation  from  the  underlying 
language  or  the  machine  code  of  the  VP  are  further  processed  by  the  compiler. 
On  the  basis  of  information  contained  in  the  object  modules,  the  compiler 
assembles  several  separately  translated  modules  into  a  single  module.  The 
modules  needed  for  the  compiler  to  work  are  taken  from  the  archive  of  object 
modules.  The  result  of  the  work  is  placed  in  the  archive  under  a  title  given 
by  the  user.  The  names  of  the  head  module  and  the  result  module  are  given  to 
the  compiler  as  parameters.  In  addition,  the  user  may  specify  a  filter, 
indicating  the  names  of  modules  inaccessible  to  the  compiler.  In  this  case, 
the  indicated  modules  will  not  be  included  in  the  resulting  compilation. 

The  process  of  compilation  is  broken  up  into  two  phases.  In  the  first  phase, 
the  junction  interface  between  program  modules  is  checked.  For  this,  a  table 
of  names  of  the  called  compiler  modules  is  constructed,  bringing  together  all 
names  of  modules  appearing  in  the  compilation  with  an  indication  of  the 
relative  address  of  the  particular  module  in  the  result  module,  and  associated 
information  as  to  the  structure  of  the  parameters  of  the  modules  appearing  in 
the  compilation.  If  incorrect  module  calls  are  found,  the  compiler  sends 
diagnostic  messages  about  this..  At  the  end  of  the  first  phase,  the  compiler 
composes  a  list  of  names  of  modules  appearing  in  the  compilation  and  their 
relative  loading  addresses  and  sends  it  to  be  printed. 

Transition  to  the  second  phase  of  compilation  occurs  only  if  all  program 
module  calls  are  correct.  In  the  second  phase,  the  resulting  object  module  is 
composed.  The  call  instructions  are  finally  composed  in  the  code  field  and 
the  other  fields  of  the  object  module  are  composed.  The  resulting  compilation 
contains  all  information  as  to  the  modules  appearing  therein  needed  for 
debugging. 

If  all  called  object  modules  are  present  in  the  archive  and  available  to  the 
compiler,  the  outcome  of  the  full  compilation  will  be  a  self-contained  program 
complex  that  can  be  sent  on  for  execution.  But  if  certain  modules  are  absent 
from  the  archive  or  not  available  to  the  compiler,  a  partly  compiled  module 
will  be  created,  containing  the  object  codes  of  the  accessible  modules,  while 
the  names  of  the  unavailable  modules  will  be  entered  into  a  corresponding 
table  of  the  resulting  module. 

The  compiler  has  a  broad  array  of  diagnostic  messages,  allowing  the  user  to 
obtain  objective  information  as  to  the  course  of  the  compilation,  as  well  as 
certain  data  on  the  resulting  module  obtained. 

2.3.  Debugging  Features 

The  development  system  of  the  VP  includes  debugging  features  in  the  form  of  an 
interpreter  of  the  instruction  system  and  an  interactive  debugger  in  the  terms 
of  the  underlying  language.  These  are  intended  for  debugging,  investigation 
of  the  structure  and  course  of  program  execution  for  the  purpose  of  subsequent 
optimization. 

The  interpreter  of  the  instruction  system  simulates  the  VP  hardware  and 
imitates  the  computation  of  the  user  program  in  the  VP  by  the  resources  of  the 
development  computer.  The  main  requirements  on  the  development  of  the 
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interpreter  were:  minimization  of  the  program  execution  time  for  the  VP, 
computational  results  consistent  with  the  Actual  ones,  and  a  means  of  program 
debugging  and  profiling. 

The  interpreter  is  built  as  a  collection  of  subprograms  which  simulate  the 
operation  of  the  following  devices  of  the  VP: 

-  the  instruction  buffer; 

-  the  index  processor  with  IRF; 

-  the  exchange  devices  between  the  RAM  of  the  VP  and  the  RAM  of  the  MCC; 

-  six  input  and  two  output  flows; 

-  six  functional  devices  and  four  sections  of  RRF; 

-  the  configurator. 

The  operating  algorithms  for  the  modules  of  the  functional  devices  reflect  the 
operation  of  the  actual  equipment.  The  modules  of  the  devices  making  up  a 
configuration  work  with  the  appropriate  time  delays.  For  the  divider,  this  is 
16  clock  periods ,  for  the  adders  and  multipliers  it  is  8  periods ,  for  the  RRF 
3  periods.  The  interpreter  simulates  only  the  configuration  element,  while 
the  other  devices  of  the  VP  are  interpreted.  The  Vt  RAM  and  the  MCC  RAM  are 
simulated  on  external  magnetic  media  in  the  actual  volumes . 

The  interpreter  works  closely  with  the  interactive  debugger.  This  is  designed 
to  debug  user  programs  and  system  programs  of  the  VP.  The  debugging  is  done 
in  interactive  mode  in  the  terms  of  the  underlying  programming  language  of  the 
VP. 

Only  self-contained  (i.e.,  obtained  by  full  compilation)  program  modules 
belong  to  the  debugger.  When  running  programs  in  the  interpreter,  the 
debugger  searches  and  retrieves  modules  in  the  system  archive. 

The  debugger  makes  it  possible  to  establish  and  remove  check  points  by  name  of 
label  or  by  number  of  macrobased  cross  (simple,  multiple,  multiple  with  a  set 
limit);  to  initialize  the  simulation  of  the  program  being  debugged;  to  examine 
the  values  of  the  variables  (in  decimal  or  octal  form)  and  their  attributes; 
to  examine  and  change  the  values  of  certain  registers  of  the  VP;  to  obtain 
information  about  the  modules  making  up  the  user  archive;  and  to  keep  a  record 
of  the  debugging. 

The  possibility  is  provided  to  gather  statistical  information  on  the  programs 
as  they  are  interpreted.  The  program  profiling  features  built  into  the 
interpreter  gather  information  on  the  run  time  of  the  program  and  its 
individual  parts  with  the  purpose  of  an  optimization,  and  also  obtain  other 
dynamic  statistics  regarding  the  user  program.  The  debugger  processes  this 
information  and  puts  it  out  to  a  terminal.  An  important  feature  of  the 
execution  of  programs  in  the  interpreter  is  the  possibility  of  a  debugging  one 
clock  period  at  a  time.  As  the  program  is  executed,  diagnostic  and  debugging 
information  appears  on  the  screen;  the  user  may  block  the  appearance  of  this 
information  by  an  appropriate  adjustment.  The  speed  of  the  interpreter  when 
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processing  instructions  of  the  index  processor  is  1000  clock  periods  per 
second^  when  processing  configuration  instructions  it  is  400  periods  per 
second. 

3*  Results  of  Experiments 

Performance  of  experiments  in  the  development  system  of  the  VP  was  an  integral 
part  of  the  design  process  of  the  VP,  having  the  objective  of  discovering 
bottlenecks  in  the  VP  architecture,  optimization  of  the  translation 
algorithms,  and  determination  of  the  efficiency  of  using  the  VP  in  vector  and 
in  scalar  processing. 

The  possibility  of  maximum  efficiency  of  the  VP  during  vector  processing  was 
established  at  a  relatively  early  stage  of  the  project  [5].  However,  it  was 
only  possible  to  establish  (with  high  degree  of  reliability)  the  effectiveness 
of  the  VP  in  scalar  processing  during  the  course  of  an  experiment  in  the 
development  system,  making  use  of  a  sufficiently  representative  application 
program  package.  For  this  reason,  the  main  attention  in  the  experiments  was 
concentrated  on  scalar  processing. 

In  order  to  assess  the  effectiveness  of  scalar  processing,  the  translator  of 
the  underlying  language  has  built-in  features  for  gathering  of  statistics  on  a 
number  of  criteria,  enabling  a  determination  of  the  average  speed  (clock 
periods  per  operation) ,  the  relative  frequency  of  possible  sources  of  conflict 
in  resources  during  attempts  to  establish  a  static  parallel  processing  of  a 
problem,  the  relative  frequency  of  occurrence  of  a  conflict  in  different 
situations,  the  average  displacements  when  joining  two  subtree -operands 
together  and  when  joining  the  next  AOST  to  the  previous  ones.  The  parameters 
gathered  by  the  translator  are  as  follows: 

-  the  total  number  of  jointures  of  subtree-leaves; 

-  the  total  number  of  nonconflict  jointures  of  leaves; 

-  the  total  number  of  jointures  of  nontrivial  subtrees; 

-  the  number  of  nonconflict  jointures  of  nontrivial  subtrees; 

-  the  number  of  unsuccessful  attempts  to  join  nontrivial  subtrees  with  the 
smaller  subtree  loaded  for  four  clock  periods ; 

-  the  total  of  resultant  upward  displacements  of  the  smaller  subtree  relative 
to  the  larger  one  when  joining  nontrivial  subtrees; 

-  the  total  of  resulting  displacements  of  an  AOST  with  respect  to  previous 
ones; 

-  the  total  of  minimum  boundaries  dictated  by  the  information  dependence  of 
displacements  of  the  AOST  relative  to  the  previous  ones; 

-  the  total  number  of  jointures  of  an  AOST  to  the  previous  ones; 

-  the  total  number  of  operations  in  all  AOSTs ; 
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-  the  overall  time  (in  clock  periods)  for  the  groups  of  partial  reconfiguraT 
tion  instructions  corresponding  to  all  AOSTs; 

-  the  total  number  of  resource  conflicts  recorded  (both  in  processing  the  AOST 
and  in  joining  the  AOST  to  the  previous  ones)  individually  for  each  resource 
(incoming  flows,  outgoing  flows,  functional  devices,  sections  of  the  RRF) . 

Statistics  were  gathered  in  two  groups  of  tests. 

The  first  small  group  of  seven  tests  (565  scalar  floating-point  operations) 
contained  programs  for  evaluation  of  an  elementary  function,  programs  for 
solving  differential  equations,  and  several  fragments  of  user  programs.  This 
group  of  tests  was  characterized  by  being  replete  with  scalar  computations  of 
real -valued  type;. 

In  order  to  gather  statistics  on  a  more  representative  class  of  problems,  a 
limited  converter  from  FORTRAN  to  the  underlying  programming  language  of  the 
VP  was  developed,  adequately  converting  the  scalar  processing  segments. 

The  second  group  of  96  tests  (3322  floating-point  operations)  consisted  of 
subroutines  converted  from  FORTRAN,  included  in  the  LIBRARY  1,2  of  standard 
programs  of  the  Dubna  monitor  system  of  the  BESM-6.  Programs  were  taken  from 
different  partitions  of  these  libraries  for  the  testing: 

-  polynomials  and  special  functions; 

-  operations  on  functions  and  solving  of  differential  and  integral  equations; 

-  interpolation  and  approximation; 

-  solving  of  systems  of  linear  equations; 

-  nuclear  physics  and  chemistry; 

-  problems  of  optimization,  model  investigation,  and  control  methods; 

-  high  energy  physics. 

The  conversion  was  done  (as  is  possible  in  certain  cases)  without  vector iza- 
tion  and  involved  only  the  real-value  processing  segments,  including  simple 
transfer  (assignment) .  It  was  observed  that  such  transfers  in  the  programs  of 
the  second  group  occupy  a  considerable  portion  of  the  real -value  processing, 
which  is  the  reason  for  the  lower  speed  (in  clock  periods  per  operation)  for 
the  second  group  of  tests.  Furthermore,  the  programs  of  the  second  group  were 
written  for  the  user  computer,  and  they  abound  in  branch  instructions,  so  that 
the  average  length  of  continuous  (linear)  segments  is  not  large. 

We  present  the  averaged  statistical  parameters  (giving  in  brackets  the  results 
for  the  less -representative  first  group  of  tests). 

1.  The  percentage  of  nonconflict  combinations  of  subtree -operands  in  an  AOST 
without  relative  displacement  is  90.7  percent  (90.5  percent). 
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2.  The  percentage  of  nonconflict  combinations  of  subtrees  that  are  not 
terminal  nodes  in  an  AOST  without  displacement  is  66.2  percent  (70.2  percent), 

3.  The  percentage  of  nonconflict  combinations  of  a  subtree  with  a  terminal 
node  in  an  AOST  without  displacement  is  93.4  percent  (96.3  percent). 

4.  The  average  displacement  of  the  smaller  subtree  relative  to  the  larger  one 
in  an  AOST  is  0.37  (0.01)  clock  periods  (loading  for  four  clock  periods  is 
taken  with  negative  sign) . 

5.  The  average  minimum  shifting  of  the  current  AOST  relative  to  the  preceding 
ones  as  a  result  of  information  dependency  is  14.3  (43.4)  clock  periods. 

6.  The  average  shifting  of  an  AOST  relative  to  the  preceding  ones  with 
allowance  for  resolution  of  resource  conflicts  and  information  dependencies  is 
14.7  (48.8)  clock  periods . 

7.  Statistics  on  the  causes  of  resource  conflicts: 

-  memory  exchange  flows  10.8  percent  (31.8  percent); 

-  divider  -  0.5  percent  (1.8  percent); 

-  RRF  sections  -  78.9  percent  (43.2  percent); 

-  logic  device  -  2.9  percent  (0.1  percent); 

-  adders  SO,  SI  -  2.3  percent  (0.6  percent); 

-  multipliers  UO,  U1  -  4.5  percent  (22.5  percent); 

8.  Average  speed  is  8.8  (3.8)  clock  periods  per  operation. 

For  the  individual  AOST,  the  best  indicator  of  the  speed  in  the  tests  is 
2.2  clock  periods  per  operation  (corresponding  to  a  speed  of  9  MFLOPS) .  For  a 
sequence  of  AOSTs,  the  best  speed  in  specially  selected  tests  was  0.73  clock 
periods  per  operation  (29  MFLOPS) . 

On  the  basis  of  the  obtained  statistical  data,  the  following  conclusions  can 
be  drawn  with  respect  to  the  efficiency  of  the  VP  in  scalar  computations. 

1.  The  most  critical  resource  is  the  RRF  sections. 

2.  The  high  percentage  of  causes  of  data  flow  conflicts  for  the  first  group  of 
tests  is  not  characteristic  of  the  general  case  and  is  explained  by  the 
presence  in  this  group  of  a  test  that  is  a  long  linear  segment,  comprising 
operators  of  the  form: 

<SCALAR>  :  -  <EXPRESSION  PROCESSING  ELEMENTS  OF  ARRAYS>. 

For  such  segment  between  different  AOSTs,  there  are  no  information  dependences 
if  the  scalars  on  the  left  side  are  different,  and  in  the  absence  of 
constraints  the  planning  algorithm  performs  a  considerable  number  of  searches 
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among  alternatives.  The  speed  for  this  test  (143  operations)  amounted  to 
1.1  clock  periods  per  operation. 

3.  The  most  serious  impediment  in  combination  of  the  execution  of  different 
AOSTs  making  up  a  linear  segment  is  the  information  dependency. 

4.  The  speed  of  the  VP  in  scalar  processing  is  roughly  an  order  of  magnitude 
lower  than  the  speed  in  vector  processing. 

As  a  result  of  the  experiments  and  analysis  of  statistics  in  realistic 
programs,  the  translator  was  modified  to  improve  the  quality  of  the  code 
generated.  Thus,  the  workload  of  the  devices  SO,  SI,  UO,  U1  became  more 
evenly  balanced  as  a  result  of  the  modifications,  and  the  information 
dependencies  were  resolved  more  accurately.  The  need  for  further  modification 
of  the  scalar  processing  translation  algorithm  to  achieve  a  more  evenly 
balanced  workload  of  the  RRF  sections  was  identified. 

Conclusion 

Let  us  list  the  main  results  obtained  in  the  development  of  the  software  of 
the  VP  of  the  Elbrus  MCC. 

1.  A  sequential  operator  scheme  of  programs  with  parallel  vector  assignment 
operator  was  proposed  as  the  conceptual  foundation  for  the  design  of  program¬ 
ming  languages  for  vector  computers. 

2.  On  the  basis  of  the  proposed  layout  of  programs,  a  high-level  underlying 
programming  language  was  developed  for  the  VP  of  the  Elbrus  MCC ,  containing 
sophisticated  means  of  description  of  subarrays  and  allowing  maximum  efficient 
utilization  of  the  VP  with  programmable  configuration  of  the  processing 
pipeline . 

3.  A  technology  has  been  developed  for  construction  of  translators  that  is 
especially  efficient  in  the  design  of  pipeline  computing  systems  with  program¬ 
mable  configuration  of  the  processing  pipeline  and  with  very  long  instruction 
word  (VLIW  systems).  This  technology  has  been  tested  out  in  the  design  of  the 
VP. 

4.  As  part  of  the  project,  a  version  of  the  language  REFAL/2  for  the  Elbrus 
MCC  was  created. 

5.  A  VP  development  system  has  been  realized  in  the  BESM-6,  operating  under 
control  of  the  OS  DISPAK,  in  the  presence  of  the  ARFA  archive  system.  In 
addition,  translators  of  the  underlying  language  and  the  VP  machine  code  have 
been  implemented  in  the  Elbrus  MCC, 

6.  Experiments  conducted  in  the  VP  development  system  have  produced  a  qualita¬ 
tive  understanding  of  the  bottlenecks  in  the  architecture  of  the  VP  and  made 
it  possible  to  establish  the  effectiveness  of  using  the  VP  in  both  vector  and 
scalar  processing.  Furthermore,  the  experiments  enabled  an  optimization  of 
the  translation  algorithms  both  with  respect  to  translation  time  and  quality 
of  generated  code. 
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1 

[Article  by  N.  S.  Fetisov,  M.  V.  Tverdokhlebov,  A.  A.  Sablukov,  and  A.  N. 
Kramarenko ] 

[Text]  The  expanded  functional  capabilities  and,  thus,  more  complicated 
hardware  of  computers,  computing  complexes  and  systems,  and  the  increased 
demands  on  the  operating  reliability  of  the  systems,  give  rise  to  the 
necessity  of  developing  methods  of  technical  diagnostics.  Such  methods  are 
involved  in  many  areas  of  the  engineering  sciences  in  data  processing  and 
computer  technology.  Issues  of  inspection,  diagnostics,  and  reliability  are 
considered  during  the  development  of  the  architectures  of  computing  systems 
and  the  structural  principles  of  implementation  of  the  computer  devices,  in 
the  circuit  engineering,  and  in  the  development  of  the  computer  software. 

The  complicated  choice  of  a  method  of  inspection  and  diagnosis  of  a  particular 
system  results  both  from  the  broad  spectrum  of  existing  methods  of  implementa¬ 
tion  of  such  task  and  from  the  contradictory  demands  on  the  system  of 
inspection  and  diagnosis  of  each  particular  system  under  development.  A 
further  degree  of  complexity  in  the  handling  of  such  task  is  brought  in  by  the 
necessity  of  optimization  of  the  solution,  taking  into  account  the  high  cost 
of  the  diagnostics  in  the  hardware  and  other  expenses  and,  especially,  the 
impossibility  of  achieving  certain  methods  of  diagnostics  on  account  of  the 
existing  technical  constraints  and  demands  for  a  particular  product.  As  a 
result,  an  integrated  approach  is  usually  employed  to  handle  the  area  of 
inspection  and  diagnostics  of  the  hardware,  combining  elements  of  architec¬ 
tural,  structural,  circuit -engineering  and  software  reliability  and  technical 
diagnostics . 

One  of  the  ways  of  handling  the  question  of  detection  and  finding  of  faults  in 
digital  equipment  is  the  use  of  the  method  of  a  shadow  shift  register, 
implemented  (in  particular)  in  several  projects  of  the  IBM  company,  as  well  as 
in  the  development  of  the  vector  processor  (VP)  of  the  Elbrus -2  multiprocessor 
computing  complex  (E-2  MCC)  [1] .  In  the  VP  of  the  E-2  MCC,  this  method  is 
implemented  in  a  somewhat  modified  form. 

The  essence  of  the  shift  register  method  involves  the  combining  of  all 
functional  flip  flop  elements  of  a  digital  device  into  a  single  shift 
register,  granting  a  hardware  possibility  of  reading  and  writing  in  this 
register  by  means  of  a  shift  operation.  For  this,  it  is  possible  either  to 
use  microcircuits  with  an  auxiliary  shift  function  as  the  functional  shift 
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registers  or  to  employ  a  duplication  of  the  functional  flip  flop  elements  with 
copying  of  information  into  "shadow"  flip  flop  elements  not  used  in  normal 
functioning  of  the  equipment. 


Implementation  of  such  possibility  affords  read  and  write  access  to  any  given 
flip  flop  element  within  a  digital  device  (having  a  minimum  number  of  external 
connections)  and,  in  providing  a  hardware  interface,  makes  it  possible  to 
Utilize  the  obtained  information  in  handling  the  task  of  inspection  and 
diagnosis  in  a  certain  processing  device,  external  with  respect  to  the 
equipment  under  diagnosis. 

The  external  processing  device  (terminal  computer)  for  the  VP  of  the  E-2  MCC 
is  the  minicomputer  SM-1420  [2],  provided  with  a  special  interface  to  the  VP 
and  special  software.  The  assemblage  of  these  devices  is  known  as  the 
diagnostic  utility  system  of  the  VP  of  the  £-2  MCC  [3].  The  authors  have 
worked  out  the  architecture  and  organized  the  software  of  this  system.  The 
system  is  intended  for  use  in  the  debugging  phase,  and  also  for  inspection  and 
diagnostics  during  operation  of  the  VP. 


Development  of  the  ideology  and  hardware  implementation  of  the  shift  register 
method  in  the  equipment  of  the  VP  was  accomplished  by  the  development 
engineers  of  the  vector  processor  of  the  E-2  MCC  [1,  4].  Powerful  hardware 
support  of  the  tasks  handled  by  the  diagnostic  utility  system  is  accomplished, 
in  particular,  by  a  special  built-in  utility  device  of  the  VP,  which  besides 
providing  communications  with  the  SM-1420  terminal  computer  supports  all  kinds 
of  interaction  between  the  terminal  computer  and  the  VP  [4] . 


The  connection  of  the  diagnostic  utility  system  (DUS)  to  the  VP  is  shown 
schematically  in  Fig.  1.  Although  the  DUS  was  specially  developed  for  the  VP 
it  can  also  be  used  for  any  other  synchronous  digital  equipment  in  which  a 
hardware  interface  is  provided  to  the  DUS,  and  the  flip  flop  elements  in  the 
circuits  are  combined  into  a  single  (possibly  shadow)  shift  register. 


Key: 

a. 


b. 


Fig.  1.  Connection  of  the  DUS  to  the  VP. 
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The  structure  of  the  diagnostic  utility  system  is  shown  in  Fig.  2.  The 
firmware  interface  to  the  VP  enables  a  read  and  write  exchange  by  a  snapshot 
or  shift  chain  between  the  VP  and  the  working  storage  of  the  SM-1420  terminal 
computer,  loading  in  the  VP  and  starting  of  the  functional  inspection  test  of 
the  VP,  stopping  of  the  VP,  changing  of  the  operating  modes  of  the  VP, 
starting  the  VP,  and  also  performance  of  other  functions  stipulated  by  the 
instruction  system  of  the  interface  unit. 
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Fig.  2.  Structure  of  the  diagnostic  utility  system. 
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As  noted,  all  flip  flop  elements  are  combined  into  shift  registers  or  chains 
in  the  shift  register  method.  The  totality  of  all  chains  for  a  device  or  the 
processor  as  a  whole  is  known  as  a  snapshot.  It  is  necessary  to  point  out  in 
the  following  discussion  that  the  number  of  flip  flop  elements  (stages)  in  the 
chain  is  unknown.  The  sequence  of  chains  in  the  snapshot  is  also  unknown. 

Thus,  the  DUS  is  able  to  control  the  operation  of  the  VP  and  obtain  informa¬ 
tion  from  the  VP  in  the  form  of  snapshots.  On  the  other  hand,  the  DUS  is 
provided  with  a  link  to  the  archives  of  a  computer-aided  design  (CAD)  system, 
thus  obtaining  access  to  the  archive  descriptions  of  the  devices  of  the  VP, 
consisting  of  a  list  of  the  elements  and  a  list  of  the  interconnections  and 
names  of  the  signals,  the  names  of  the  cards  and  modules.  All  of  these 
attributes  are  represented  in  the  schematic  diagrams  of  the  product  and 
constitute  the  basis  for  an  understanding  of  the  product,  its  operation  and 
repair. 

Since  this  information  is  (of  course)  absent  from  the  electronic  circuits  and, 
thus,  from  the  snapshots  obtained  by  the  DUS  via  the  link  to  the  VP,  the  DUS 
should  be  able  to  perform  a  one-to-one  transformation  of  the  numbers  of  the 
flip  flops  from  the  shift  chain  into  the  terms  of  the  circuit  documentation 
and  vice  versa. 

In  order  to  handle  this  task,  the  DUS  is  provided  with  tables  of  names  of  the 
signals  for  each  device  of  the  VP,  which  is  kept  in  the  archive  of  tables  of 
correspondence  of  the  names  of  the  signals  of  the  VP  devices.  The  table 
establishes  a  correspondence  between  the  name  of  the  signal  in  the  diagram  of 
the  card  of  a  particular  device  and  the  number  of  the  bit  in  the  shift  chain. 

Using  the  name  of  the  signal  and  the  name  of  the  card  for  a  particular  device, 
the  table  makes  it  possible  to  find  the  number  of  the  bit  in  the  corresponding 
shift  chain  and,  on  the  contrary  (as  is  easily  seen  in  Fig.  3),  the  name  of 
the  signal  and  the  name  of  the  card  to  which  this  signal  belongs  is  found  from 
the  number  of  the  bit  of  the  shift  chain  of  a  particular  device.  The  organi¬ 
zation  of  the  tables  of  correspondence  of  the  signal  names  (CSN)  allows 
identical  signal  names  to  be  present  in  different  cards.  On  the  other  hand, 
replacement  of  the  signal  names  and  card  names  with  address  references 
(Fig.  3)  enables  a  substantial  reduction  in  volume  of  memory  required  to  store 
these  tables,  which  is  of  definite  value  when  the  DUS  is  implemented  in  a 
minicomputer . 

The  entry  of  the  CSN  tables  was  resolved  in  the  DUS  by  developing  a  special 
entry  language  and  editor  of  the  tables  of  correspondence  of  signal  names  of 
the  devices,  making  extensive  use  of  indexing  of  the  register  signals  and  the 
numbers  of  the  stages  of  the  shift  chain  when  entered  into  the  table  by  the 
user.  The  indexing  makes  it  possible  to  reduce  the  volume  of  work  of  user 
entry  of  the  CSN  table  by  several  times  (tens  of  times  for  register  signals). 

The  mechanism  of  the  CSN  tables  thus  allows  a  transformation  of  the  user 
directives,  expressed  in  terms  of  circuit  documentation,  into  the  language  of 
the  DUS-VP  interface,  i.e.,  in  terms  of  the  numbers  of  stages  of  the  shift 
chains . 
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The  main  tasks  handled  by  the  diagnostic  utility  system  from  the  standpoint  of 
the  user  of  the  DUS  are  as  follows: 
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-  indicfltion  of  tlis  static  condition  of  the  flip  flop  elements  of  the  VP, 

-  dynamic  indication  from  one  clock  period  to  another  with  output  of  the  VP 
operation  time  diagrams; 

-  modeling  of  individual  circuits  and  devices  of  the  VP  from  one  clock  period 
to  another  on  the  basis  of  an  archive  with  comparison  against  the  VP  operating 
diagrams  from  one  clock  period  to  another; 

-  diagnostics  and  localization  of  VP  faults. 

In  the  article,  we  shall  discuss  only  the  first  three  of  these  tasks.  The 
method  of  diagnostics  and  localization  of  faults  by  use  of  shift  registers  has 
no  direct  bearing  on  the  subject  of  the  present  work,  which  is  mainly  devoted 
to  the  architecture  and  organization  of  the  DUS .  We  merely  note  that  granting 
of  access  to  all  flip  flop  elements  of  the  digital  circuit  substantially 
facilitates  the  solving  of  the  problem  of  diagnostics  and  synthesis  of  test 
sequences.  In  such  approach,  it  is  possible  to  use  the  most  simple  and 
well-studied  methods  of  analysis  and  synthesis  of  tests  for  combination 
layouts.  However,  in  practice,  the  problem  is  somewhat  more  complicated, 
since  not  all  flip  flop  elements  have  been  incorporated  into  a  shift  chain  in 
the  VP  under  consideration.  Furthermore,  the  situation  is  greatly  complicated 
by  the  need  to  analyze  not  only  hard  failures ,  but  also  soft  failures  in  the 
functioning  of  the  equipment,  especially  in  the  case  when  the  hardware 
monitoring  sends  the  emergency  stop  signal  much  later  than  the  appearance  of 
the  error  in  operation  of  the  equipment. 

Indication  of  the  static  condition  of  the  flip  flop  elements  of  the  VP 
involves : 

-  formation,  editing,  and  output  to  the  terminal  and  printer  of  indication 
patterns ; 

-  output  of  the  values  of  the  shift  chain  or  snapshot  of  the  VP  by  the 
indication  patterns,  the  names  of  the  registers,  and  the  names  of  the 
individual  signals. 

The  indication  pattern  here  is  a  certain  product  of  the  display  console,  the 
format  of  which  will  be  used  subsequently  to  put  out  an  indication  from  the 
VP.  In  the  DUS,  the  screen  of  the  alphanumeric  display  of  the  terminal 
computer  is  used  as  the  display  screen  for  the  VP.  Any  desired  number  of 
indication  patterns  may  be  prepared  in  advance  and  stored  in  the  indication 
pattern  archive  of  the  DUS.  A  pattern  is  the  contents  of  one  terminal  screen, 
filled  in  by  the  indication  pattern  editor  in  a  special  mode.  In  the  places 
where  it  is  necessary  to  show  the  value  of  a  particular  register  or  individual 
signal  during  the  indication,  the  title  of  such  is  directly  entered.  During 
the  indication,  the  title  is  replaced  with  the  value  from  the  shift  chain  or 
snapshot.  Any  desired  alphanumeric  labels  or  notes  may  be  placed  in  the  field 
of  the  screen.  These  remain  unchanged  during  indication  and  make  it  possible 
to  create  legible  displays.  The  values  of  the  registers  can  be  put  out  in 
base  2,  8,  or  16  arithmetic  systems,  and  also  in  the  form  of  an  integer  and  a 
real  value . 


Dynamic  indication  from  one  clock  period  to  another  with  output  of  the 
operating  time  diagrams  of  the  VP  is  secured  by  the  organization  of  a  special 
mode  of  operation  of  the  VP  with  storage  of  the  sequence  of  snapshots  of  the 
VP  or  chains  of  a  specified  device  in  the  DUS  archive  during  each  clock 
period,  as  well  as  a  program  block  for  management  and  editing  of  the  signal 
lists  for  the  time  diagrams.  Different  versions  of  such  lists  may  be  written 
into  the  archive  of  signal  lists  for  the  diagrams. 

j 

...  •  i 

After  a  given  sequence  of  snapshots  of  the  operation  of  the  VP  is  written  into 
the  archive  of  snapshots  and  shift  chains,  this  sequence  may  be  used  on  many 
occasions  to  investigate  any  given  diagrams  by  any  given  user  of  the  DUS  over 
a  lengthy  period  of  time.  This  ensures  a  minimum  time  of  occupation  of  the 
VP,  which  is  important  in  the  case  of  debugging  or  repairing  of  the  VP, 
allowing  the  VP  to  be  used  from  several  DUS  terminals  at  the  same  time. 

The  sequences  of  snapshots  formed  in  the  DUS  archive  are  also  used  for 
comparison  against  the  results  of  the  modeling  of  individual  circuits  and 
devices  of  the  VP  during  each  clock  period  on  the  basis  of  their  archive 
description.  Adequate  functioning  of  the  model  and  equipment  depends  much  on 
the  choice  of  an  appropriate  method  of  modeling,  avoiding  possible 
discrepancies  between  the  operation  of  the  model  and  that  of  the  equipment. 

The  issue  of  adequacy  of  modeling  methods  is  discussed,  in  particular,  by  [5] . 

In  addition  to  the  above,  the  functions  of  the  DUS  include: 

-  placing  the  values  of  the  shift  chain  in  symbolic  names  and  editing  thereof; 

-  comparing  a  list  (chain)  obtained  from  hardware  against  a  snapshot  (chain) 
previously  stored  in  the  archive  of  snapshots  of  the  DUS ;  such  option  allows  a 
diagnosis  of  the  VP  in  terms  of  stored  sequences  of  snapshots  during  short 
tests;  in  event  of  discrepancy,  the  names  of  the  flip  flops  and  the  numbers  of 
the  clock  periods  at  the  time  of  the  discrepancy  are  put  out; 

-  checking  the  switching  of  the  indicated  group  of  flip  flops  in  a  given  test 
for  preliminary  analysis  of  completeness  of  the  test. 

The  database  of  the  diagnostic  utility  system  enables  storage  of  all  types  of 
information  used  with  explicit  indication  of  the  type  and  program  checking  for 
its  proper  use.  The  information  is  protected  against  unauthorized  access. 

The  software  of  the  DUS  was  developed  in  the  framework  of  the  operating  system 
RAFOS  SM-1420  (a  counterpart  of  the  system  RT-11  of  the  DEC  company)  [6] , 
although  in  the  standard  mode  the  DUS  runs  under  control  of  its  own  monitor, 
specifically  oriented  to  carry  out  the  group  of  DUS  programs.  In  addition  to 
the  main  requirement  of  control  of  the  DUS ,  the  monitor  was  developed  to 
ensure : 

-  minimization  of  the  memory  volume  of  the  resident  portion  of  the  monitor; 

-  substantially  increased  reliability  of  information  storage,  including 
protection  against  soft  failures  during  the  operation  of  the  file  system  (as 
compared  to  the  OS  RAFOS) ; 
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-  effective  operation  with  information  arrays  in  the  working  storage,  the  size 
of  which  is  greater  than  64  K  byte,  thanks  to  direct  use  of  additional 
registers . 

The  last  point  is  especially  important  in  assuring  sufficient  speed  of 
modeling  of  devices  of  the  necessary  extent. 

In  the  information  structure  of  its  archives ,  the  DUS  monitor  is  fully 
compatible  with  the  OS  RAFOS .  Interesting  in  their  own  right  are  the  screen¬ 
line  text  editor  with  dynamic  segmentation  of  the  file  being  edited  and 
protection  against  soft  failures,  as  well  as  a  debugger  program  for  operation 
with  assembler  programs,  which  were  developed  in  the  context  of  the  software 
of  the  DUS.  These  programs  can  be  used  by  themselves  within  the  OS  RAFOS. 

The  DUS  monitor  supports  simultaneous  operation  of  as  many  as  four  terminals, 
and  also  enables  execution  of  the  DUS  operators  in  interactive  and  batch 
modes . 
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SEVERAL  STRUCTURAL  AND  ALGORITHMIC  ASPECTS  OF  A  SCALAR-VECTOR  COMPUTATIONAL 
MODEL 

907G0089E  Moscow  SUPER-EVM  in  Russian,  pp.  77-96 
[Article  by  A.  S.  Olenin] 

[Text]  Specialization  is  one  of  the  most  important  ways  of  increasing  the 
speed  of  computations.  From  the  standpoint  of  data  processing,  computations 
are  divided  into  two  major  groups:  numerical  and  logical,  so  that  any  given 
algorithm  may  be  characterized  by  a  number  I  =  Qi/Qn*  where  and  Qj^  are  the 
number  of  numerical  and  logic  operations,  respectively,  in  it.  This  number 
serves  as  a  criterion  of  the  "intellectuality"  of  the  algorithm  and 
characterizes  to  some  degree  its  area  of  computation.  Thus,  when  I  «  1,  the 
computations  should  be  characterized  as  substantially  numerical,  and  when 
I  »  1  as  substantially  logical.  The  concepts  for  design  of  the  computational 
equipment  for  each  of  these  regions  may  differ  significantly  from  each  other , 
In  the  first  case,  for  example,  we  require  processors  oriented  to  numerical 
solution  of  scientific  and  engineering  problems,  and  in  the  second  case 
processors  which  handle  information  in  logical  space. 

The  early  stages  of  development  of  computer  equipment  involved  the  creation  of 
universal  computers  capable  of  covering  a  relatively  small  region  of  I  values 
on  either  side  of  unity  and  therefore  meeting  the  practical  needs  for  process¬ 
ing  of  numerical  and  logical  information  only  in  limited  extent.  The  present 
stage  of  computations  requires  ever  further  penetration  into  the  regions  of 
small  and  large  numbers  I,  and  we  are  witnessing  how  special  devices  or 
processors  meeting  these  needs  are  being  created  in  the  context  of  universal 
machines  or  in  self -standing  form. 

The  domain  of  extremely  low  I  is  of  major  importance  to  the  numerical  solution 
of  scientific  and  engineering  problems  which  demand  a  steady  increase  in 
computational  work.  The  increase  in  volume  of  computations  should  not  entail 
a  major  growth  in  the  volume  of  the  programs.  Hence  the  desire  to  structure 
the  data  in  such  way  that  their  dimension  does  not  control  the  length  of  the 
program  notation.  This  approach  corresponds  to  a  vector  or  matrix  organiza¬ 
tion  of  data,  which  in  many  cases  requires  a  special  formulation  of  the 
algorithm.  Actual  algorithms  generally  contain  both  vector  and  scalar  (i.e., 
nonvectorizable  or  poorly  vBctorizable)  sections  and  a  certain  amount  of 
combining  logic.  The  scalar  component  is  characterized  by  the  now  customary 
and  traditional  computations  in  the  region  of  I  numbers  much  closer  to  unity 
than  the  initial  algorithm  as  a  whole,  and  therefore  these  can  be  realized 
effectively  by  universal  means.  Thus,  one  of  the  ways  of  entering  the  domain 
of  supernumerical  computations  I  «  1  is  breaking  up  the  algorithm  into  scalar 
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and  vector  components  and  applying  a  universal  structure  to  the  first  of 
these,  and  a  specialized  vector  processor  based  on  pipeline,  matrix,  or  some 
other  organization  to  the  second.  The  concept  of  a  program  in  its  fundamental 
sense  is  retained,  being  supplemented  primarily  by  the  introduction  of 
linguistic  vector  constructs. 

The  process  of  computation  may  be  represented  by  a  sequence  of  steps  of  type: 

q  a^b, 

where  a,  b,  q  are  the  vectors  of  the  operands,  9  is  the  vector  of  operation 
signs  -f,  X,  etc. 

We  may  distinguish  several  cases  of  interpretation  of  this  expression, 
comparing  them  in  simplified  fashion  with  the  types  of  computations. 

1.  Scalar  computations  may  be  defined  by  a  sequence  of  steps: 

which  differs  from  the  above  in  that  the  operands  and  the  sign  of  the 
operation  in  each  step  are  scalars. 

2.  Vector  computations  involve  a  representation  of  the  algorithm  by  a 
collection  of  steps  of  type: 

where  the  operands  are  vectors,  while  the  signs  of  the  operations  are  scalars. 

3.  Parallel  computations  directly  correspond  to  the  notation: 

Each  of  these  expressions  can  be  matched  up  with  its  own  type  of  computational 
structure  in  the  familiar  classification:  the  first  with  SISD,  the  second 
with  SIMD,  the  third  with  MIMD.  Hereafter  we  shall  be  concerned  only  with  the 
first  two  types  of  computation,  assuming  that  if  the  vectors  of  the  signs  of 
the  operations  in  the  latter  expression  contain  a  large  number  of  identical 
elements  then  it  is  advisable  to  disjoin  the  parallel  computations  into  more 
simple  blocks  of  vector  type,  having  separate  control  for  their  own  scalars  of 
the  operations. 

The  most  simple  model  of  scalar-vector  computations  may  be  described,  e . g. ,  by 
the  sequence  of  steps: 

where  f (a' ,  b')  is  a  scalar  function  of  vectors  which  specifies  the  scalar 
parameter  for  the  following  vector  process.  This  model  admits  an  analogy  with 
the  operational  portion  of  the  axioraatics  of  a  vector  space  over  a  field  of 
real  numbers,  granting  the  possibility  of  an  exchange  of  information  between 
the  space  and  the  field.  This  immediately  delineates  the  contours  of  a 
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scalar-vector  calculator  which  assumes  the  presence  of  two  interacting 
processors,  one  of  which  processes  the  field,  the  other  the  space.  We  shall 
call  such  structural  formation  a  scalar-vector  complex  (SVC). 

If  during  each  step  of  the  model  it  is  possible  to  combine  in  time  the 
performance  of  a  vector  operation  and  the  calculation  of  a  scalar  parameter  to 
be  used  in  the  next  vector  operation,  we  shall  call  such  process  a  scalar- 
vector  process  with  disjunction.  The  practical  value  of  a  scalar-vector  model 
with  disjunction  is  that  the  scalar  component  in  this  case  may  be  performed  in 
the  background  of  the  vector  computation,  in  no  way  hindering  it.  In  condi¬ 
tions  of  disjunction,  the  speed  of  the  computation  depends  on  the  speed  of  the 
vector  processor,  and  the  complex  problems  of  accelerating  the  scalar 
component  when  it  becomes  a  "bottleneck”  may  be  simply  unimportant.  Of 
course,  this  requires  that  the  vector  component  surpass  the  scalar  so  greatly 
in  volume  of  computations  that  even  the  use  of  d  vector  processor  would  not 
result  in  a  computational  time  advantage  for  the  scalar  component. 

In  the  actual  situation,  scalar-vector  computations  should  be  understood  in  a 
much  broader  sense  than  that  in  the  elementary  model.  This  is  especially  true 
of  the  function  of  a  scalar  processor  (SP) ,  which  is  entrusted  with  the 
support  of  the  operation  of  the  vector  processor  (VP),  in  addition  to  the 
immediate  computations.  The  disjunction  of  an  algorithm  into  scalar  and 
vector  parts  should  be  done  for  reasons  of  effectiveness  and  balance  in  the 
operation  of  the  complex.  The  scalar  component  may  keep  the  unutilized 
options  of  vectorization,  which  will  be  implemented  in  event  of  overloading  of 
the  vector  processor. 

If  the  range  of  user  problems  contains  a  vector  component  of  sizable  volume 
and  if  the  scalar  computation  is  not  the  "bottleneck"  of  the  computations,  the 
speed  of  the  complex  (as  already  noted)  will  depend  on  the  speed  of  the  VP, 
and  the  highest  priority  of  the  scalar  processor  should  be  to  ensure  an 
uninterrupted,  fully  loaded  operation  of  the  vector  component.  The  operating 
time  of  the  SP  not  devoted  to  this  function  may  be  used  in  multiprogram  mode 
for  other  less  important  tasks.  Such  mode  enables  high  speed  both  in  an 
individual  task  and  in  a  mixture  of  tasks:  in  the  first  case,  thanks  to  the 
VP,  in  the  second,  thanks  to  the  SP. 

The  design  principles  for  the  Soviet  complex  for  scalar-vector  computations 
were  advanced  by  V.  S,  Burtsev  in  the  course  of  the  development  of  the 
Elbrus -2  MCC  [1]  and  the  vector  processor  [2].  Implementation  of  these 
principles  required  a  solution  to  a  number  of  special  hardware  and  system 
problems.  At  the  same  time,  in  the  context  of  the  topic  of  mutual  representa¬ 
tion  of  computing  structures  and  algorithms,  studies  in  the  organization  of 
algorithms  to  ensure  effective  functioning  of  SVC  are  becoming  important. 
Summarizing  the  above,  the  following  are  the  main  requirements  on  the 
properties  of  the  algorithms : 

1.  A  vector  quality.  Vectorization  of  the  algorithms  should  strive  to  make 
the  volume  of  vector  calculations  in  them  much  greater  than  the  volume  of  the 
scalar  part. 

2.  Loading  properties  of  the  algorithms.  The  vector  component  should  ensure 
(where  possible)  full  workload  of  the  functional  devices  of  the  VP  in  the 
course  of  its  processing,  A  full  workload  should  be  taken  to  be  a  processing 
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of  vectors  with  constant  and  not  overly  small  size  in  sufficiently  large 
computational  segments . 

3.  Disjunction  of  computations  into  scalar  and  vector  components.  By 
disjunction  (as  we  have  seen)  is  understood  not  merely  a  breakup  into  scalar 
and  vector  components,  but  a  construction  (if  possible)  of  such  combination 
thereof  that  the  scalar  element  can  be  executed  in  the  background  of  the 
vector  computation,  in  no  way  hindering  it. 

The  special  nature  of  the  above  qualities  involves  the  fact  that  their 
realization  occurs  on  the  algorithmic  level ,  regardless  of  the  particular 
structure  of  the  SP  and  VP.  We  present  below  a  brief  discussion  of  several 
algorithms  for  solving  systems  of  linear  algebraic  equations  (SLA.E)  by  using 
the  aforesaid  properties  in  an  application  to  a  model  of  the  SVC.  Discussion 
of  the  detailed  estimates,  comparisons,  and  applications  to  specific  architec¬ 
tures  exceeds  the  ambition  of  the  present  work.  General  approaches  to  the 
parallel  running  of  a  number  of  special  algorithms  of  mathematical  physics 
will  be  found,  e.g.,  in  [4,  5]. 

1.  Scalar-Vector  Algorithmics  of  Dense  Matrices 

We  shall  assume  that  the  matrix  A  in  the  system  Ax  =  f  is  dense,  real-valued, 
nondegenerate,  and  we  consider  a  model  of  SVC  that  realizes  the  following 
construction  in  each  "n"-th  step  of  the  computations: 


(1) 

(2) 


where  a^,  bj^  are  vectors,  is  a  scalar  function  of  vectors.  As  the 
discussion  proceeds,  the  model  will  be  supplied  with  the  necessary  supplements 
where  necessary. 

The  scheme  (1),  (2)  is  a  scalar-vector  process  with  disjunction  on  the  basis 
of  the  vector  process  (2) .  The  vector  process  (2)  will  be  termed  complete  if 
the  dimensions  of  the  vectors  being  processed  are  kept  unchanged  from  one  step 
to  another.  In  this  case,  a  complete  workload  of  the  functional  devices  of 
the  VP  can  be  automatically  assured  and  there  is  no  need  to  monitor  the  length 
of  the  vectors  during  the  computation.  We  define  the  scalar  as  the 
parameter  of  the  vector  process  (2)  in  the  "n**-th  step.  Processes  with  vector 
parameters  that  will  be  encountered  further  on  call  for  an  element  by  element 
vector  operation  of  multiplication,  instead  of  a  scalar  one. 

Let  us  give  some  characteristic  examples .  Say  it  is  required  to  evaluate  the 
expression: 

where  x,  f  are  vectors,  A  is  a  matrix  of  order  N  with  columns  a^.  This 
expression  may  be  written  naturally  in  the  form  of  a  vector  process: 
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The  role  of  the  parameters  of  the  process  is  played  here  by  the  components  of 
the  vector  x.  If  the  matrix  A  is  dense,  the  vector  process  is  complete.  But 
in  the  case  of  a  triangular  matrix,  for  example,  the  completeness  of  the 
process  is  lost,  since  in  order  to  avoid  operations  on  the  zero  elements  it  is 
necessary  to  change  the  length  of  the  vector  from  one  step  to  another  and  to 
deal  with  short  vectors  at  a  particular  stage.  This,  of  course ,  entails 
further  expenses  on  organization  of  the  computations  and  has  a  negative  impact 
on  the  operating  efficiency  of  the  processor.  Vector  processes  that  are 
formed  on  the  basis  of  triangular  matrices,  i.e.,  such  as  have  a  linear  change 
in  the  length  of  the  vector  from  one  step  to  another,  will  be  termed 
triangular.  We  note  that  the  method  of  doubling,  e.g.,  for  the  summation  of 
numbers,  results  in  a  logarithmic  change  in  the  length  of  the  vector  as  a 
function  of  the  number  of  steps  and  can  be  written  as  a  logarithmic  vector 
process. 

From  the  standpoint  of  complete  workload  of  the  VP,  complete  vector  processes 
with  length  of  vector  not  excessively  short  have  the  most  efficient  mapping 
onto  the  structure  of  the  vector  processor,  and  therefore  we  shall  devote  the 
main  attention  to  them  hereafter. 

It  should  be  noted  that  the  construction  of  complete  vector  processes  may 
involve  not  only  a  lowering  of  the  height  of  the  algorithm,  but  also  a  certain 
increase  in  its  length- -the  total  number  of  operations.  Therefore,  the 
advisability  of  using  complete  algorithms  depends  on  how  much  more 
advantageous  in  terms  of  computational  time  are  their  more  pronounced  vector 
properties  than  the  short  length  of  their  less  vectorial  counterparts.  With 
increasing  parallel  processing  resources,  the  vectorial  qualities  of  complete 
algorithms  become  predominant,  and  when  the  resources  are  sufficiently  large 
their  use  is  the  most  efficient. 


1.1,  Direct  Methods 

The  Jordan  algorithm  is  naturally  written  as  a  complete  vector  process 


where 


w 


0.4  ^  km  ^  J-  ^ 


(3) 

(4) 


^r*  ^  \  ^****^  %  •••  f  ^  • 
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The  method  of  Gauss  elimination  is  also  characterized  by  a  vector  form,  but 
the  calculation  now  involves  two  stages  in  succession.  The  forward  path 
occurs  by  a  scheme  analogous  to  (3),  (4),  where  the  vectors  are  defined 
somewhat  differently: 


and 


d  ^Jk 


(5) 

(6) 


«  ^0,0, * 
aL*"*  (o.o....,0.  i. 


('o.o,  ....  0,  O,  .M,  /  • 


The  bar  here  and  afterwards  indicates  abbreviated  vectors ,  which  are  formed 
from  the  complete  ones,  as  is  obvious,  by  replacing  the  portion  of  the 
elements  above  or  below  the  diagonal  with  zeros .  The  dot  above  a  vector 
indicates  that  the  k-th  row  of  the  matrix  in  the  k-th  step  is  replaced 

by  another  row  (in  the  present  case,  the  k-th  row  of  the  unit  matrix). 

After  (5),  (6)  is  executed,  reverse  substitution  is  accomplished  by  a 
triangular  vector  process.  Processes  (5),  (6)  are  also  triangular  in  the 
index  k. 

The  Jordan  vector  scheme  as  compared  to  the  Gauss  has  lower  height  and  less 
completeness  of  the  process,  but  is  -I. 5  times  greater  in  length.  Therefore, 
from  the  standpoint  of  traditional  consecutive  computations,  the  method  of 
elimination  may  be  regarded  as  a  certain  optimization  of  the  Jordan  scheme  for 
the  case  of  poorly  parallel  structures,  in  particular,  a  single  processor 
structure,  when  the  time  of  execution  of  the  algorithm  depends  primarily  on 
its  length.  The  decrease  in  length  by  switching  to  the  Gauss  method  is 
achieved  at  the  cost  of  a  certain  loss  in  vector  properties  of  the  Jordan 
scheme  by  partial  scalarization,  forming  two  consecutively  executed  segments: 
a  forward  path  and  a  backward  path.  Similar  methods  of  optimization  are  the 
method  of  opt imal  e 1 imina t ion ,  the  compac t  s cheme  o f  the  Gaus s  me thod ,  and 
other  techniques  that  are  effective  in  the  sense  of  a  consecutive  realization 
of  the  SLAE  solution  algorithm. 

The  decomposition  of  a  dense  matrix  into  a  Frobenius  matrix  product  results  in 
a  complete  vector  process: 
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where 
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Applying  this  decomposition  to  the  transposed  system  xV  -  f^,  we  may  see 
that  it  corresponds  to  the  Jordan  scheme.  It  is  easily  observed  that  the 
Jordan  scheme  (3),  (4)  is  a  disjoining  scalar-vector  process,  since  the 
parameters  may  be  computed  for  different  j  in  the  background  of  execution 

of  the  vector  process. 


We  may  observe  that  introducing  a  procedure  of  searching  for  the  leading 
elements  into  algorithms  of  the  elimination  type  disturbs  the  regularity  of 
the  computation  and,  thus,  reduces  the  effectiveness  of  the  vector  computa¬ 
tions.  From  this  standpoint,  and  for  reasons  of  precision,  the  methods  based 
on  orthogonal  transformations  are  preferable,  even  though  they  require  a 
larger  number  of  operations.  Let  us  consider  the  vector  qualities  of  two  such 
methods:  the  method  of  reflections  and  the  method  of  rotations. 


In  the  method  of  reflections,  the  initial  system  is  transformed  into  a 
triangular  one  by  means  of  an  orthogonal  reflection  matrix  [3]: 


If  the  vector  w  is  chosen  by  formulas  w  =  a-ae,  a  =  J(a,  a) ,  then  the  given 
vector  a  will  become  a  collinear  unit  vector  e  by  means  of  the  transformation 
Ua  =  ae.  As  applied  to  the  solution  of  SLAB,  the  method  is  written  thus: 

,  a*  tSi,  -  Qm 


Designating:  ^  ror*"*.’ 


we  have 

a,  -  Oj.  -JTk 
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where  we  obtain  for  the  parameter: 


d' 


Finally,  we  arrive  at  the  following  algorithm,  transforming  the  original 
system  into  a  triangular  one: 


■  q hk  -  <5te ,  , 


As  we  see,  the  only  difference  between  this  scheme  and  the  Gauss  scheme  is  the 
method  of  computation  of  the  parameters.  The  process  is  triangular  and  the 
height  of  its  vector  component  evidently  coincides  with  the  Gauss  scheme .  In 
order  to  convert  to  a  complete  process  with  Jordan  height,  we  convert  the 
abbreviated  vectors  to  complete  ones  or  combine  the  computation  of  the  portion 
of  the  matrix  below  the  diagonal  with  a  Jordan  elimination  in  the  triangular 
system.  In  the  first  case,  the  parameter  is  one  and  the  same  over  the  entire 
length  of  the  vector,  while  in  the  second  case  there  are  two  quantities  acting 
as  the  parameters ,  being  computed  in  terms  of  the  scalar  products  of  the 
abbreviated  vectors . 

The  method  of  rotations  is  accomplished  by  means  of  elementary  rotation 
matrices.  Say  we  have  a  matrix  in  which  the  elements  below  the 

diagonal  in  columns  1  through  k-1  are  equal  to  zero.  We  introduce  the 
notation: 


The  matrix 


T^xVT:.kXM*...ii.KW 


will  contain  zero  elements  below  the  diagonal  in 


the  k-th  column  as  well, 


provided  that: 


*  •  0 
C;w  -  S;w  • 
&iw  GiW  * 


0  ' 
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and  the  elements  of  the  matrices  are  found  by  formulas: 


Performing  the  multiplication  of  the  elementary  matrices  belonging  to  the  k-th 
column,  it  is  easily  seen  that,  using  the  equality 

the  conversion  of  the  elements  during  the  elimination  is  done  by  the  formula: 


Omitting  the  coefficient  in  front  of  the  parentheses,  on  which  the  result  of 
the  solving  of  the  system  does  not  depend,  but  the  orthogonality  of  the 
transformation  does,  we  arrive  at  the  expression: 


0^^ 

^‘4 


<s<{ 


^  tu-o 
OLCk 


tW 


Since  <  1,  this  expression  may  correspond  to  lower  values  of  the  inter¬ 
mediate  elements  than  the  previous  one.  As  a  result,  the  following  triangular 
process  is  created: 


replacing  the  abbreviated 
vectors  with  complete  ones  or  (as  above)  by  combining  this  scheme  with  a 
Jordan  elimination  for  a  triangular  system.  Unlike  the  schemes  given  above, 
the  parameter  here  is  a  vector. 

We  present  a  general  scalar-vector  elimination  scheme  for  solution  of  SLAE. 
For  this,  we  write  the  system  Ax  =  f  in  the  form: 


Y*  m  . 

We  may  arrive  at  a  complete  vector  process 
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We  shall  eliminate  the  unknowns  by  a  scalar  multiplication  of  the  left  and 
right  sides  with  vectors  that  are  orthogonal  to  the  respective  columns.  The 
vector  z  is  defined  by  means  of  relations : 


a,,  .s-  aO:, 


c  • 


where  s ,  t  are  arbitrarily  specified  nonzero  vectors ,  c  is  a  constant  chosen 
from  the  condition  (z,  ak)  =0.  If  the  unknown  Xj,  is  to  be  eliminated  from 
M(M  <  N)  equations,  we  must  accordingly  construct  the  set  of  vectors  ^ 
i  -  1,  M,  satisfying  conditions  (z^,  ak)  =  0.  Hereafter,  it  is  assumed  that 

Si  =  e^,  and  thus  we  have: 


a;- 


- 


(2.-.  ap  -  - 


cut. 

(-L.Ok') 


Using  this  formula  to  solve  a  SLAE  produces  the  following  complete  scalar 
vector  algorithm: 


(7) 


(8) 


A  similar  scheme  occurs  for  transformation  of  the  original  system  into  a 
triangular  one: 


(9) 


‘j  *■  » 


(10) 


Let  us  consider  several  cases  of  specification  of  the  vector  tj^. 


1-  “  ®k-  parameters  in  this  case  are  given  by  relations: 


i.e.,  algorithm  (7),  (8)  passes  into  the  Jordan  method,  while  algorithm  (9), 
(10)  into  the  forward  path  of  the  Gauss  method. 
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2.  tj^  =  (0»  •••>  0,  1,  0,  0)^,  where  the  ones  are  located  at 
positions  k  through  i.  Computation  of  the  parameters  comes  down  to  the 
following  ratio: 


^  ■  -tw  •'Sw  •(<>**-^0,tSww,awlu,^,,ou^ 

where 

aCw'-Sv,  , 


In  this  case  we  have: 


(aw,^ 

-  <3-u^w 


O^.  > 


^ i 

CSi 


and  an  algorithm  with  such  parameter  is  equivalent  to  the  above -considered 
scheme  for  the  method  of  reflections . 


We  obtain  an  expression  for  the  parameter: 


If  the  index  i  in  this  runs  through  values  from  k  to  N  in  each  vector  step, 
the  parameters  will  become  vectors  and  the  algorithm  will  concur  with  the 
scheme  obtained  above  for  the  method  of  reflections. 

5.  ti,  = 


Here  we  have  a  vector  process  using  all  elements  of  the  abbreviated  elimina 
tion  column: 


?  sr*’;  • 

•  fh^s  -  — ' 
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We  note  that  a  scalar-vector  notation  is  also  natural  for  the  method  of 
orthogonalization.  For  example,  the  column  version  of  this  method  can  be 
written  as : 


u.tj.  uirjTi 


in  the  forward  path,  and: 


^*1 


(at,,  j;'' 


fi.,  -  Pt  -  , 


in  the  backward  path.  As  we  see,  both  stages  represent  a  complete  scalar- 
vector  process. 

1.2.  Iteration  Methods 

While  realizing  that  the  multitude  of  iteration  algorithms  is  vast  [6],  we 
shall  touch  on  a  few  of  them  in  order  to  explain  the  characteristic  features 
of  the  scalar -vector  model.  We  shall  employ  the  following  canonical  form  [7] 
of  the  iteration  methods: 


5 


The  matrix  B  and  the  iteration  parameters  are  selected  from  conditions  of 
best  t3rpe  of  convergence  and  minimum  total  volume  of  computations.  The 
specific  quality  of  a  given  method  depends  on  the  specified  structure  of  the 
matrix  B  and  the  method  of  computing  the  parameters. 

In  the  explicit  case  B  =  E,  this  expression  may  be  rewritten  as  two  relations: 


T*"'.  P-  Ax*'-] 


from  which  it  is  obvious  that  the  explicit  iteration  scheme  is  naturally 
represented  by  a  pair  of  complete  vector  processes.  The  set  of  parameters  for 
this  may  be  found  from  a  priori  information  as  to  the  spectral  properties  of 
the  matrix  A.  In  methods  of  variation  type  and  in  schemes  of  conjugated 
directions  formulated  in  vector  form,  the  parameters  are  computed  by  using 
current  information  obtained  during  the  iteration. 

The  addition  of  an  implicit  quality  in  the  method  of  iterations  makes  it 
possible  to  shorten  the  number  of  iterations  as  compared  to  the  explicit 
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scheme,  although  in  certain  cases  the  volume  of  computations  in  each  iteration 
is  increased.  A  triangular  matrix  or  a  product  of  triangular  matrices  is 
often  used  as  the  matrix  B.  From  the  standpoint  of  vector  properties,  the 
change  to  triangular  matrices  (as  already  mentioned)  results  in  insufficiently 
effective  triangular  processes.  However,  by  special  organization  of  the 
computations,  the  possibility  exists  of  constructing  complete  vector  processes 
in  this  case  as  well. 


Let  us  consider  several  triangular  methods.  We  shall  assume  that  the  matrix  A 
has  been  reduced  to  the  form  A  ==  L  +  E  +  U,  where  L,  U  are  the  lower  and  upper 
triangular  matrices,  respectively,  with  zero  major  diagonals,  and  E  is  a  unit 
matrix. 


Consider  the  iteration  scheme: 


(  E  *  tL)  ^  4 


(11) 


which  during  each  iteration  requires  inversion  of  a  triangular  matrix.  The 
Gauss-Seidel  method  is  obtained  when  r  -  1.  In  this  case,  we  have  the 
expression: 


which  we  shall  write  as  two  relations: 


We  now  consider  the  following  vector  procedure  for  calculation  of  the  n-th 
iteration: 


...  . 


<^^r\ 


It  is  easy  to  verify  that  this  is  equivalent  to  the  previous  relations  by 
assuming  that: 


Xi. 


and  setting  *  fj^,  =  0  in  the  k-th  row  prior  to  evaluating  the  vector 

y<k)^ 

Thus,  a  complete  scalar-vector  process  for  the  Gauss-Seidel  method  will  be  as 
follows : 


91 


(12) 


where 


j'-V.  jC'i..  k.C7, 

cu  ■  (Otk,  ..^  Ok^,  0,  Oka. o^ik')^ 

^  u“^’’ 


The  parameters  here  are  the  components  of  the  unknovms,  obtained  as  the 
computations  proceed.  The  described  construction  leaves  the  length  of  the 
algorithm  almost  unchanged,  since  the  complete  process  is  formed  by  combina¬ 
tion  of  two  triangular  matrices  into  a  single  complete  one.  Triangular 
computations  remain  only  in  the  calculation  of  the  initial  approximation. 

The  method  of  consecutive  upper  relaxation  corresponds  directly  to  scheme 
(11),  where  r  is  the  parameter  of  relaxation.  Hence,  we  have  the  expression: 

or 

.  -P-  T7x*" 

x“'-- 


By  analogy  with  the  algorithm  (12),  the  latter  relations  are  equivalent  to  a 
complete  vector  process: 


aV,  (l-xT)  _ 


(13) 


where  (as  above); 


We  note  that  the  difference  between  algorithms  (12)  and  (13)  is  concentrated 
in  the  definition  of  the  parameters,  while  their  vector  component  is 
identical . 
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Let  us  consider  the  alternately  triangular  method  in  the  form: 


Here,  during  each  iteration  it  is  necessary  to  invert  a  triangular  matrix 
twice.  We  introduce  the  notation: 


Using  this,  we  will  have: 


V9e‘-> 

X*'-’.  a**'- Lse***' 


The  latter  two  relations  (as  previously)  may  be. combined  into  a  complete 
vector  process,  where 

and  inversion  of  a  triangular  matrix  is  only  necessary  to  find  the  vectors 
z(k)^  The  starting  approximation  is  computed  by  formulas: 

•?.  z-'.  xf*'- 

and  also  admits  of  a  combination  into  a  complete  process, 

2.  Features  in  the  Organization  of  the  Computations 

As  follows  from  the  above  analysis,  the  characteristic  feature  of  schemes 
represented  as  complete  vector  processes , is  the  fact  that  they  consist  of  two 
interrelated  parts:  scalar  and  vector.  The  properties  of  the  algorithms  are 
determined  by  the  scalar  part,  which  specifies  the  method  of  computing  the 
parameters  for  the  vector  processes.  But  the  vector  part,  which  requires  the 
largest  volume  of  computations,  remains  almost  unchanged.  Therefore,  it  is 
perfectly  natural  to  employ  here  the  model  of  the  SVC  as  a  structurally 
similar  model.  The  vector  process  is  used  especially  to  accomplish  the  vector 
part  of  the  algorithms.  The  parameters  required  for  its  operation  are 
supplied  by  the  scalar  processor,  the  operation  of  which  takes  place  in  the 
background  of  the  vector  computations.  It  is  clear  from  the  formulas  for  the 
generalized  elimination  scheme  that  the  input  data  for  the  vector  part  are  the 
parameter  of  the  process  and  the  corresponding  substituted  row.  If  the 
functioning  of  the  complex  is  coordinated  so  that  these  data  are  available  to 
the  vector  processor  at  the  start  of  each  vector  step,  the  time  for  the 
computations  depends  entirely  on  the  speed  of  the  latter. 

In  the  case  of  orthogonalization,  the  scalar  and  the  vector  parts  are 
comparable  in  number  of  operations,  and  therefore  one  should  consider  the 
prospect  of  vectorization  of  the  evaluation  of  the  parameters  themselves,  in 
order  for  the  use  of  the  described  complex  to  remain  efficient.  This  is  also 
true  of  other  methods  containing  an  excessively  laborious  determination  of 
parameters . 
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For  reasons  of  economy,  In  ttie  elimination  schemes  it  is  better  to  employ  not 
complete,  but  partial  scalar  products  in  calculating  the  parameters. 

Therefore,  in  order  to  achieve  coordination,  it  is  sufficient  to  compute 
during  each  step  only  as  many  terms  in  the  scalar  products  as  can  be  handled 
by  the  scalar  processor  up  to  the  time  when  the  vector  processor  will  require 
the  corresponding  parameter.  In  particular,  the  number  of  terms  may  be  set  in 
advance  by  specifying  a  certain  number  of  nonzero  elements  of  a  smoothing 
vector  t^.  With  such  organization  of  the  computations,  it  is  possible  for  the 
scalar  processor  to  generate  at  the  outset  a  series  of  parameters  which  differ 
in  the  number  of  terms  in  the  scalar  products  and  to  select  the  one  with  the 
minimum  modulus .  This  allows  automatic  prevention  of  situations  with  loss  of 
precision,  without  resorting  to  a  special  search  for  the  leading  element.  It 
is  important  to  emphasize  that  these  operations  concern  only  the  scalar  part^ 
of  the  algorithm,  and  therefore  the  operating  mode  of  the  complex  with  dynamic 
supply  of  parameters,  while  retaining  high  speed,  is  capable  of  providing 
computations  of  good  precision. 
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[Text]  Introduction 

The  transformation  of  time-varying  signals  into  the  frequency  domain  — 
spectral  analysis  —  is  a  powerful  tool  for  studying  a  wide  variety  of 
physical  entities:  electronic  devices,  communication  channels,  acoustic  and 
seismic  waves,  water  waves,  hydrodynamic  turbulence,  the  vibrations  of 
machines,  etc. 

In  the  sonic  frequency  range  (up  to  20-40  kHz) ,  the  traditional  analog 
methods  of  spectral  analysis  have  today  been  completely  replaced  by  digital 
methods,  such  as  recursive  digital  filtering  and  methods  which  make  use  of 
various  versions  of  fast  Fourier  transforms  [FFT] .  These  methods  are 
completely  indispensable  in  the  spectral  analysis  of  infralow  frequencies, 
and  they  surpass  the  digital-analog  method  of  temporal  compression  of  a 
signal  in  terms  of  analysis  accuracy. 

Recording  the  power  spectrum  is  not  always  the  ultimate  goal:  subsequent 
processing  and  analysis  (sometimes  extremely  complicated)  of  the  resulting 
spectrograms  are  frequently  necessary.  Recursive  digital  filtering  can  be 
used  effectively  only  for  a  logarithmic  frequency  scale;  this  circumstance  is 
a  serious  burden  for  subsequent  processing,  especially  in  the  case  of 
multichannel  analysis.  We  will  accordingly  restrict  the  discussion  below  to 
devices  which  operate  on  the  basis  of  FFT  algorithms.  We  can  distinguish 
four  approaches  in  the  implementation  of  FFT  analysis: 

1.  The  conventional  device  approach.  Because  of  the  multitude  of  analysis 
parameters,  the  corresponding  instruments  —  spectrum  analyzers  —  are 
becoming  progressively  more  complex  to  control,  and  large  and  specialized 
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control  panels  with  tens  of  different  buttons  and  knobs  are  required.  The 
latter  frequently  operate  under  the  control  of  a  separate  microprocessor. 
Apparatus  of  this  type  is  beihg  produced  by  many  manufacturers:  Bruel  & 

Kjaer  (Denmark),  Hewlett-Packard  (US)  and  Ono  Sokki  (Japan).  A  dedicated  FFT 
processor  is  built  into  these  instruments  to  increase  their  speed.  It 
increases  the  signal  band  which  can  be  processed  in  real  time  to  2-5  kHz.  In 
general,  all  the  processors  of  these  instruments  are  narrowly  specialized  and 
do  not  allow  the  user  to  add  .hew  functions.  They  are  thus  not  very  effective 
in  scientific  research. 

2.  Stand-alone  data  acquisition  systems.  In  this  approach  the  signal  is  first 
recorded  by  a  data  acquisition  system,  which  is  a  separate  unit  usually 
containing  a  high-quality  multichannel  analog— to— digital  converter  (ADC)  and 
a  digital  memory  of  40—100  Kwords,  equipped  with  an  MEK— 625  interface.  In  a 
system  of  this  type  the  data  processing,  including  the  spectral  analysis,  is 
carried  out  on  a  personal  computer,  which  is  also  connected  to  the  system. 

The  computational  capabilities  and  display  facilities  of  the  personal 
computer  are  utilized.  Some  pypical  instruments  are  the  Data  Laboratories 
DL-1200  and  DL-1208  (Britain),  the  Solartron  1250  (Britain)  and  the 
Hewlett-Packard  3852  system  (US) .  A  positive  feature  of  these  systems  is  that 
they  are  open  to  the  user. 

3.  Instruments  based  on  general-purpose  microprocessors.  In  this  approach,  a 
microcomputer  based  on  a  fast  general-purpose  microprocessor  (usually  the 
MC68000  or  the  8086)  is  built  directly  into  the  apparatus  which  contains  the 
data  acquisition  system  and  the  display  devices.  An  instrument  of  this  sort 
is  controlled  by  a  small  keyboard,  and  the  functions  of  the  various  keys  can 
usually  be  synthesized  on  the  monitor  screen  and  can  be  changed  as  the 
apparatus  is  adjusted  to  the  desired  type  of  operation.  As  an  example  we 
could  cite  the  Analogic  DATA-6000  (Data  Precision,  US),  in  which  the  basic 
operations  are  performed  by  software  on  an  MC68000  microprocessor.  More 
elaborate  processing  can  be  programmed  by  the  user  in  a  language  of  the  BASIC 
type  with  a  calculator  keyboard.  .  Systems  of  this  type  are  obviously  open  to 
the  user,  although  the  maximum  frequency  of  the  data  which  can  be  processed 
in  real  time  is  rather  low,  200^400  Hz. 

4.  Measurement  and  computation  complexes.  The  tasks  involved  in  the  spectral 
analysis  of  signals  can  also  be  performed -by  open  complexes  based  on 
bus-module  CAMAC  systems  under  the  control  of  a  minicomputer,  with 
appropriate  software  and  display  facilities.  Such  systems  offer  formidable 
analysis  capabilities  but  may  be  unacceptably  large.  Examples  are  the  SK4-71 
(USSR)  and  the  Intertechnique  IN-100  (France)  analyzers. 

In  this  connection,  there  is  practical  interest  in  implementing  signal 
analyzers  based  on  the  widely  available  open— architecture  microcomputers 
(with  a  bus-module  organization),  e.g.,  those  of  the  Elektronika  60  family. 
Ideally,  no  structural  changes  would  be  necessary,  and  there  would  be  a 


96 


minimum  number  of  peripheral  devices.  Our  purposes  in  the  present  study  were 
to  develop  the  missing  hardware  components  (an  input  unit  and  a  display 
controller) ,  to  develop  appropriate  software  for  processing  and  control,  and 
to  study  the  capabilities  of  a  single-channel  version  of  the  apparatus. 


1.  Hardware 

The  architecture  of  the  analyzer  has  the  classic  bus-module  structure  and  is 
shown  in  fig.  1.  The  basic  assembly  of  the  microcomputer  (the  central 
processing  unit,  the  memory  and  the  terminal  and  printer  interfaces)  is 
supplemented  with  a  system  for  entering  analog  data  and  a  graphics 
controller.  We  should  point  out  that  the  memory  of  the  computer  must  be 
static,  so  that  refreshing  by  the  central  processing  unit  does  not  interrupt 
the  data  stream.  The  analyzer  is  designed  for  stand-alone  operation  without 
any  peripheral  storage  devices,  so  it  includes  an  MS3404.04  programmable 
read-only  memory  (PROM)  module,  which  stores  the  entire  program  and  certain 
tests.  Even  if  an  M2  processing  is  used  in  the  analyzer,  the  box  of  the 
computer  still  has  room  for  a  half-height  board,  a  serial— communications 
interface  or  a  floppy  drive  controller.  The  capabilities  of  the  analyzer 
would  thereby  be  expanded  in  a  qualitative  way. 

1.1.  The  data  input  subsystem  uses  113PV1  lO-digit  integrated-circuit  ADC;  it 
has  two  channels  and  an  interface  which  provides  direct  access  to  memory  and 
which  is  based  on  a  KR580VT57  large-scale  integrated  circuit.  The  input  of 
each  of  the  two  channels  has  a  symmetric  differential  circuit  in  which  three 
KR544UD1A  operational  amplifiers  are  used.  Each  of  the  operational 
amplifiers  is  followed  by  a  KR1100SK2  sampling  and  storage  circuit.  The 
half-height  board  also  contains  a  programmable  quartz  Clock  based  on  a 
KR580VI53  large-scale  integrated  circuit,  which  sets  the  data  sampling 
period.  The  analog  section  is  powered  by  the  ±12-V  power  supply  of  the 
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computer  through  isolating  LC  filters.  There  is  no  galvanic  decoupling. 
Tests  have  shown  that  the  pickup  from  the  digital  devices  does  not  exceed 
of  a  digit  of  the  ADC  in  the  input  range  ±2.5  V.  The  data  acquisition  rate 
can  range  up  to  32-33  kHz,  through  two  channels  simultaneously,  so  a 
frequency  analysis  band  of  up  to  16  kHz  can  be  achieved. 

The  existing  version  of  the  data  input  system  does  not  have  a  controllable 
antialiasing  filter;  the  user  himself  is  expected  to  limit  the  band  of  the 
input  signal. 

1.2.  An  important  function  of  the  analyzer,  if  it  is  to  be  user-friendly,  is 
to  provide  a  high-quality  display  of  the  results  of  the  measurements  in 
graphics  form.  Unfortunately,  graphics  terminals  are  not  yet  widely 
available.  We  attempted  to  reconcile  the  standard  alphanumeric 
representation  of  a  15IE-00-013  display  unit  with  a  graphics  representation 
generated  by  a  special  video  buffer.  The  display  unit  itself  was  not 
revised.  The  circuits  for  sweeping  the  module  of  the  graphics  video  buffer 
are  synchronized  by  the  line  and  frame  pulses  coming  from  a  cable  connecting 
the  video  logic  of  the  display  unit  to  its  monitor.  The  data  generated  by 
the  video  buffer  are  summed  with  the  alphanumeric  video  signal  and  sent  to 
the  input  of  the  monitor,  with  the  result  that  a  sum  of  images  is  formed. 

The  microprogramming  unit  of  the  video-memory  access  ensures  a  nonflickering 
image,  regardless  of  software  accesses  of  the  video  buffer,  in  any  part  of 
the  raster  sweep  cycle.  The  basic  module  (on  a  half-height  board)  forms  an 
image  field  of  512  x  256  pixels  and  has  the  hardware  capability  of  shifting 
the  entire  image  at  steps  of  one  pixel  vertically  and  eight  horizontally. 
There  is  provision  for  increasing  the  number  of  memory  layers  by  connecting 
an  additional  board. 


2.  Control  Program 

As  can  be  seen  in  fig.  1,  the  analyzer  has  no  specialized  modules  for 
calculating  FFT.  All  of  the  data  processing  is  carried  out  by  software 
facilities.  The  program  initializes  three  pseudoprocessors,  which  are 
synchronized  in  a  special  way. 

The  first  processor  has  top  priority  and  controls  the  data  input  and  the 
transmission  data  to  the  processing  processor.  The  input  processor  controls 
the  direct— access  unit,  generating  addresses  at  the  appropriate  time  for  the 
binary  data  buffering.  It  starts  the  data  processing,  monitoring  the 
processing  to  see  whether  it  is  being  carried  out  in  real  time.  In  addition 
it  synchronizes  the  triggering  of  the  data  acquisition  by  an  input  signal: 
there  is  a  free  or  slave  triggering  regime  upon  a  positive  or  negative  front 

The  data  processing  processor  is  implemented  as  a  pseudomachine  which 
contains  a  memory  of  macroinstructions  of  a  special  format,  which  in  a  sense 
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Table  L 
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Elektronika  10(V2S 

0.G1 

Table  2.  Comparatiye  Characterisics  of  Certain  Common  Sig¬ 
nal  and  Spectrum  Analyzers 


Model 

Manufacturer 

Frequency 
band.  kHz 

Input: 
digits  (dB) 

Channels 

Windows 

I 

•  X 

OC  JC 

m 

s 

09 

4 

E 

o  E 
>  -o 

e 

I. 

SM-II 

USSR 

0—50 

lOp 

2 

7 

16 

204S 

2347 

550 

2515 

cDnicl  &  K  jacr» 

0-20 

(GO) 

1 

2 

? 

1275 

24 

16 

2031 

Thesame 

0—20 

12p 

1 

2 

2 

4(M) 

66 

22 

2033 

» 

0—20 

12p 

1 

2 

2 

4tXl0 

66 

22 

2034 

s 

0-25,6 

12p 

2 

4 

1.6 

801 

66 

35 

3582A 

ellcwIcU  —  Packards 

0-25,5 

(70) 

1 

3 

7 

25(1 

44 

24 

3561A 

Thesame 

0-100 

m 

1 

4 

7,5 

4(N) 

39 

22 

3562A 

» 

0-100 

(70) 

1 

Ci- 

10 

800 

54 

26 

CP-200 

«Ono  Sokkis 

0—20 

12p 

1 

•> 

0.2 

4IMI 

18 

8 

CF-gzo 

Thesame 

0—100 

12|i 

'» 

6 

2 

KtHI 

54 

28 

Analyzer  of  preserd  study 

0-16 

lOp 

1 

3r 

0,3 

4006 

116 

51 

constitute  a  problem-oriented  language  of  the  analyzer.  A  seq[uential 
interpretation  of  the  macroinstructions  by  an  execution  subprogram  arranges 
the  data  processing  in  the  necessary  order.  The  processor  has  a  regime  of 
editing  of  the  macroinstructions,  so  there  is  some  flexibility  in  the 
processing.  The  memory  is  automatically  reallocated  under  the  buffer  files 
with  the  appropriate  control.  In  a  single-channel  version,  the  following  set 
of  macroinstructions  was  realized:  multiplying  an  input -data  vector  by 
several  standard  windows,  taking  direct  and  inverse  Fourier  transforms, 
forming  spectral  amplitudes,  linear  and  exponential  averaging,  taking  the 
logarithm  of  the  spectrum,  scaling  the  spectrum  and  generating  a  graphics 
representation. 

The  control  processor  synthesizes  on  a  graphics-free  field  of  the  screen  a 
multilevel  menu;  it  processes  the  user's  instructions.  A  particular  line  on 
the  menu  is  chosen  by  pressing  the  key  with  the  appropriate  number  once;  the 
effect  is  to  maximize  the  speed  at  which  the  apparatus  is  controlled.  In  the 
specification  of  numerical  values,  extensive  use  is  made  of  a  cyclic 
selection  of  several  standard  values  by  means  of  control  keys  (the  arrow 
keys,  etc.).  In  cases  where  it  is  possible,  constants  are  specified 
directly,  with  the  necessary  protection  against  incorrect  values.  The  entire 
control  of  the  input  and  output  is  implemented  through  an  EMT  processor; 
diagnostics  of  the  hardware  and  error  messages  are  controlled  through  a  TRAP 
processor. 
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Most  of  the  data  processing  time  is  taken  up  by  the  FFT.  We  implemented  a 
standard  Cooley-Tukey  algorithm  in  integer  arithmetic.  The  speed  of  this 
algorithm  is  equal  to  the  speed  of  the  FFT  program  of  [ 1 ] •  There  is  the 
difference  that,  in  order  to  economize  on  memory,  the  sine  table  is  converted 
for  each  new  value  of  the  length  of  the  data  file  to  be  processed,  and  a 
complex  repacking  algorithm  is  used  to  accelerate  the  calculations  of  the  FFT 
of  a  real  vector.  The  operating  time  of  the  algorithm  (the  transformation  of 
1024  real  points)  on  various  software-compatible  computers  is  shown  in  table 

1.  Interestingly,  the  speed  of  a  single-board  microcomputer  such  as  the 
MS1201.02  is  only  10  percent  lower  than  that  of  an  Elektronika  100/25 
minicomputer,  and  it  is  20  percent  better  than  the  speed  of  the  Elektronika 
81  or  Elektronika  60.1  microcomputer. 

The  program  described  above  has  made  it  possible  to  implement  a  signal 
analyzer  which  is  controlled  as  an  ordinary  instrument.  After  the  power  is 
turned  on,  the  program  is  automatically  loaded,  and  a  10-item  menu  is 
displayed: 

0.  Set  parameters  (data  acquisition) 

1.  Program  1  (display  input  data) 

2.  Program  2  (power  spectrum) 

3.  Program  3  (initial  data  and  spectrum) 

4.  Control  processing 

5.  Control  cursor 

6.  Generate  hard  copy  (on  a  D-lOO  printer) 

7.  Control  display 

8.  Plan  memory 

9 .  Edit  program 

Each  instruction  has  its  own  submenu,  or  it  synthesizes  information  about  the 
current  functions  of  the  control  keys. 

Program  2,  typical  of  spectral  analysis,  includes  the  acquisition  of  a  data 
file  (usually,  512  readings),  a  multiplication  of  this  file  by  a  Hamming 
window,  the  calculation  of  FFT  of  the  resulting  file,  the  generation  of 
spectral  amplitudes,  the  conversion  of  these  amplitudes  to  logarithmic  scale, 
the  normalization  to  a  given  base  level,  and  transmission  to  a  graphics  field 
on  the  screen  with  a  given  dynamic  range  and  an  emphasis  of  up  to  10  dB.  The 
resulting  periodograms  can  be  averaged  in  a  linear  or  a  logarithmic  regime, 
and  the  type  of  window  can  be  changed  (rectangular,  triangular,  Henning, 
Hamming  or  user— defined) .  After  a  sufficient  buildup  of  the  spectrum,  one 
can  move  a  cursor  over  the  graphics  window  and  specify  in  digital  form  the 
relative  amplitude  of  the  spectrum  at  the  frequency  selected  by  the  cursor. 
After  instruction  6  is  selected,  the  printer  generates  a  copy  of  the  image 
along  with  data  specified  by  the  cursor  and  a  complete  set  of  values  of  the 
parameters  of  the  given  measurement.  The  program  takes  about  10  Kwords  of 
memory  and  allocates  the  remaining  18  Kwords  for  user  data.  It  thus  becomes 
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possible  to  generate  up  to  4096  lines  in  a  power  spectrum  (without  a  regime 
of  averaging,  and  with  a  rectangular  or  Henning  window) . 

Finally,  table  2  summarizes  the  characteristics  of  certain  common  types  df 
signal  analyzers;  the  bottom  row  shows  the  characteristics  of  the  new  device 
which  we  have  been  discussing  here.  Although  the  input  characteristics  and 
the  processing  rate  are  slightly  lower  than  those  of  the  foreign  samples,  we 
believe  that  the  open  nature  of  this  system  and  the  possibility  of  adding  to 
the  complexity  of  the  subsequent  processing  completely  offset  that 
disadvantage,  particularly  in  view  of  the  many  problems  to  be  solved  in  the 
low-frequency  and  inf ralbw-frequency  regions.  Furthermore,  the  input 
characteristics  of  this  apparatus  could  be  improved  significantly  by  using 
other  signal  input  modules.  The  signal  band  which  can  be  processed  in  real 
time  could  be  increased  to  25-30  kHz  by  using  fast  peripheral  processors  of 
the  type  described  in  [2] . 
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SOFTWARE 
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*ADIMEX  AGRO^  ADAPTIVE  IMITATIVE  EXPERT  SYSTEM  FOR  TECHNOLOGICAL  PROCESS 
CONTROL  IN  FARMING 
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[Article  by  A. A.  Shevchenko,  candidate  of  agricultural  sciences,  and 
V.G.  Lapa,  candidate  of  technical  sciences] 

[Text]  Large  systems  often  encompass  objects  of  control  which,  in 
principle,  cannot  be  studied  in  active  experiment  (action-reaction) 
modes.  Such  systems  and  objects  include  chemical  and  nuclear  reactors, 
large  enterprises  and  associations,  economic  and  ecological  regional  and 
global  systems,  and  objects  in  the  agroindustrial  complex. 

In  solving  problems  in  the  ecology,  economics,  or  management, 
forecasting  is  of  primary  importance.  The  consequences  of  any  decision 
or  purposeful  action  should  be  taken  into  account  as  accurately  as 
possible.  The  volume  and  completeness  of  information  about  an  object, 
its  measured  and  unmeasured  characteristics,  and  the  state  of  and 
changes  in  the  environment  stipulate  different  approaches  to  solving  a 
problem.  The  fundamental  concept  is  based  on  three  approaches — 
deterministic,  probability  and  adaptive. 

The  deterministic  approach  is  based  on  known  data  or  data  that  can  be 
determined  beforehand  with  the  requisite  accuracy. 

The  probability  approach  takes  into  account  the  influence  of  obstructing 
parameters  (noises,  interference).  The  a  priori  characteristics  are 
considered  probable  and  defined  according  to  known  implementations. 


102 


Both  approaches  are  based  on  a  large  volume  of  selected  observation 
data,  i.e.,  representative  statistics  are  required. 

The  adaptive  approach  is  unavoidable  when  solving  problems,  for  which 
the  initial  data  are  contained  in  brief  samplings.  Often,  not  only  are 
the  mathematical  descriptions  of  objects  and  of  the  environment  unknown 
beforehand,  but,  for  a  number  of  reasons,  it  also  seems  impossible  to 
determine  them.  The  quality  of  a  problem's  solution  should  improve  in 
proportion  to  the  accumulation  of  information.  The  system  should  be 
capable  of  self-organization  and  forecasting  [l]. 

Study  of  the  objects  during  the  process  of  their  natural  functioning 
(passive  experimentation)  is  characterized  by  a  substantial  growth  in 
the  volume  of  data  and  requires  the  application  of  fundamentally  new 
technologies  for  collecting,  analyzing,  sampling  and  processing  of 
information.  The  improvement  of  technical  equipment  (universal  and 
personal  computers)  makes  it  possible  to  organize  and  centralize 
information  streams  in  automated  information  retrieval  systems, 
automated  data  processing  systems,  and  automated  management  systems 
(ASU) ,  which  either  exist  or  are  being  developed.  At  the  present  time, 
the  main  trend  is  to  develop  universal  technologies  for  information 
processing,  subsequently  adapting  them  to  task-oriented  systems. 

However,  intensification  in  the  area  of  automatic  control  problems 
requires  the  creation  of  specialized  information  technologies,  taking 
into  account  the  features  of  specific  ASU. 

The  "Agroprognoz"  Automated  Management  System  (developed  by  the 
Ukrainian  branch  of  "AIUS-Agroresursy"  VNITs)  is  intended  for  on-line 
control  of  production  processes  in  farming.  Models  for  forecasting  the 
phenological  phases  of  plant  development,  which  considers  the  existing 
and  predicted  agrometeorological  situation,  forms  a  basis  for 
development  of  an  ASU.  Recommendations  are  being  drafted  for  the 
application  of  technological  methods,  aimed  at  achieving  maximum  crop 
yield.  The  "Agroprognoz"  ASU  includes  the  automated  information 
subsystems  (AIP):  "Pogoda"  [Weather];  "Fenologiya"  [Phenology];  and  the 
"Zashchita  Rasteniy"  [Plant  Protection]  and  "Tekhnologiya"  [Technology] 
information  management  subsystems  (lUP)  [2]. 

Input  information  for  the  system:  agrochemical,  agrometeorological, 
plant  growth  and  development  rates,  plant  biometric  indicators, 
quantitative  indicators  for  productivity  factors,  information  on  the 
most  common  pests  and  diseases,  their  characteristics  and  thresholds  of 
destructiveness,  characteristic  ways  of  protection,  information  on  the 
phytosanitary  condition  of  crops,  on  soils  in  natural  climatic  zones  and 
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provinces,  specific  features  of  regionalized  agricultural  crop 
varieties,  a  description  of  cultivation  methods,  and  ratings  of  the 
effectiveness  of  technological  methods  used  under  various  soil  and 
agroclimatic  conditions . 

Output  information:  characteristics  of  agrometeorological  conditions 
£qi«  tihe  periods  of  crop  growth  and  development,  assessment  of  weather 
conditions  for  current  vegetation,  a  prediction  of  basic 
agrometeorological  factors,  time  periods  for  start  of  phenophase  and 
biometric  indicators  on  the  prehistory  and  for  the  current  vegetation 
period,  an  evaluation  and  forecast  of  crop  conditions  for  a  requested 
date,  prediction  of  time  periods  for  appearance  of  diseases  and  spread 
of  pests,  information  on  recommended  means  of  protection, 
recoranendations  on  most  effective  agrotechnical  methods  to  use  in  a 
situation  that  has  specifically  occurred  or  is  predicted. 

The  enumerated  data  and  knowledge  structures  can  conditionally  be 
divided  into  three  basic  groups  [3]:  established  information  (supplied 
by  state  services),  special  and  normative  information  (biological  and 
genetic  features  of  crops,  the  corresponding  constants,  technical  and 
economic  indicators  and  reference  materials,  data  from  scientific 

and  scientific  production  organizations,  from  zone  varietal 
testing  stations,  and  from  special  services  organized  on  farms)  and 
information  on  the  material  and  technical  possibilities  of  farms,  their 
plan  and  economic  indicators,  the  distribution  structure  for  areas  being 
planted,  etc. 

In  the  process  of  developing  the  ^Agroprognoz  ASU,  new  mathematical 
(multi“f actor  regression  and  imitative)  models  of  plant  cultivation 
production  processes  are  developed,  and  verification  is  done  for 
existing  ones.  However,  if  sufficiently  representative  statistical 
material  has  accumulated  for  the  information  streams  and  AIP  models, 
data  on  the  conditions  and  results  of  utilization  of  means  of  protection 
and  agrotechnical  methods  for  the  lUP  are  contained,  as  a  rule,  in  short 
samples  from  the  results  of  observations  and  are  a  component  part  of  the 
knowledge  of  expert  specialists.  The  problem  of  creating  the  system  has 
made  it  necessary  to  develop  specifications  for  a  technology  for 
processing  the  results  of  meteorological,  phonological  and  biological 
observations  and  for  modeling  biological  objects.  In  this  regard,  the 
methods  and  approaches  being  developed  within  the  framework  of 
"artificial  intelligence”  were  used,  and  the  ADIMEX  AGRO  adaptive 
imitative  expert  management  system  was  created  as  a  result. 
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In  the  computerized  ADIMEX  system,  knowledge  is  represented  by  the  sum 
total  of  frames  (situations,  scenarios)  and  deduction  rules.  The 
original  set  of  frames  is  determined  by  experts  in  the  field  (in  this 
case,  by  agronomists,  soil  scientists,  agrometeorologists,  specialists 
in  plant  protection,  and  others). 

The  expert  "supplies”  knowledge  to  the  engineering  and  systems  unit 
(interpreter),  which  knows  the  structure  of  ADIMEX  and  is  capable  of 
representing  knowledge.  Knowledge  is  often  "fragments  of  prose," 
containing  rules  of  the  type:  "if...,  then...,"  "if...,  then  often...," 
"if...,  then  with  great  probability...,"  etc.  Fairly  complex  statements 
may  take  the  place  of  the  dots,  containing  both  numbers,  as  well  as 
linguistic  modifiers  of  the  type:  "big,"  "average,"  and  "rarely,"  as 
well  as  vague  concepts,  for  example,  "with  great  probability ,"  "surface 
plowing,"  and  "harvest  shortfall."  Knowledge  is  then  stored  in  the 
computer  in  special  logical  structures  and  processed  through  methods, 
known  as  "knowledge  engineering."  User  access  to  the  memory  of  a  large 
computer  at  his  work  place  is  provided  by  a  display,  connected  to  the 
computer  via  the  telephone  network  or  a  special  cable.  Considering  the 
rates  and  prospects  for  computer  hardware  development,  it  will  be 
possible  in  the  near  future  to  create  large  imitative  expert  systems 
based  on  personal  computers. 

From  time  to  time,  ADIMEX  is  "reviewed"  by  expert  verifiers.  These  may 
be  specialists,  invited  from  outside,  or  the  experts  themselves,  who 
formed  this  system,  can  also  perform  this  role.  From  the  initial  set  of 
frames,  a  knowledge  base  [KB]  is  formed,  which  is  used  for  comparison 
with  existing  and  predicted  situations  for  the  purpose  of  drafting 
recommendations  and  management  actions.  During  operation,  the  KB  is 
continuously  supplemented  and  corrected  as  a  result  of  receipt  of  new 
infonnation  on  the  outcomes  of  work  by  the  system  itself,  as  well  as 
information  from  literary  sources,  reports,  and  developments — this  is 
the  adaptive  systems  operation  mode  (continuous  updating).  ADIMEX 
should  not  only  make  decisions  in  a  timely  manner,  but  also  give  an 
acceptable  explanation  as  to  why  precisely  this,  and  not  some  another 
method  should  be  chosen,  and  it  should  rate  (for  instance,  in  points) 
solution  variants  for  different  situations  from  the  viewpoint  of 
economics,  ecology,  production  quality,  implementation  complexity,  etc. 

In  order  to  represent  knowledge,  structures  are  used,  based  on  the 
calculation  of  mathematical  logic  statements  and  predicates,  for 
example,  representations  of  the  implication  type,  IF — THEN:  condition — 
action.  Part  of  the  IF  rule  is  the  formulation  of  a  number  of 
conditions  (prerequisites),  under  which  the  rule  is  applicable: 


1Q5 


IF  1)  . . .  and. . .  or. .  . ; 

2 )  . . .  or .... 

3) ...; 

THEN  1)...  and...; 

2)  ,  but  not . . . ; 

3^ ...  or ... 

The  effect  of  the  "or"  part  of  a  THEN  rule  is  a  conclusion,  which  must 
be  made  if  the  condition  is  met. 

The  ALSIR  algorithmic  language  for  describing  situations  and  solutions 
( reconmendations )  is  structured  on  the  laws  for  calculating  statements 
and  predicates.  For  example,  the  block  "situation-solution"  can  be 
presented  in  the  following  form;  (  zone  ZjA  soil  PjiA>  precursor^  ^ 
...)  — ^  ('technological  methods'  T^,  'aggregate  A^A^A  fertilizer 
F5  .  . .  ) ,  where  A  is  the  sign  AND — conjunction;  V  is  the  OR  sign 
disjunction;  and  ^  is  the  IF-THEN  sign~~implication.  Logical 
relations  tables  described  in  an  algorithmic  language  are  programmed  and 
used  in  order  to  form  knowledge  bases . 

The  formation  and  management  of  knowledge  structures  on  a  computer  is 
supported  by  a  knowledge  base  management  system,  based  on  the  relational 
model  of  knowledge  representation.  Relational  knowledge  bases  are 
sufficiently  user-friendly  for  those  who  have  little  experience  with 
computers . 

Raising  the  immediacy  of  control ,  multi— variant  solutions  with 
ecological  and  economic  assessments,  optimization  of  agrotechnical 
methods,  and  rational  use  of  material  and  labor  resources  are  the  basic 
factors  which  ensure  the  economic  expediency  of  developing  and 
introducing  an  adaptive  imitative  expert  system. 

Practical  use  of  the  "Agroprognoz"  ASU  makes  it  possible  to  increase  the 
production  output  per  hectare  of  plowed  field  by  0.2-0. 3  centners  in  a 
grain  equivalent,  or  by  2.5  rubles,  which,  taken  for  the  UKSSR  on  the 
whole,  amounts  to  over  75  billion  rubles.  Taking  into  account 
development  outlays,  the  expected  economic  effect  from  using  the  system 
in  the  republic  is  about  40  million  rubles. 
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[Text]  Digital-filtering  methods  are  very  popular  in  the  processing  of 
experimental  data.  One  reason  is  the  widespread  dissemination  of  fast 
parallel  processing  computers.  A  second  is  the  appearance  of  new  and  more 
efficient  algorithms  for  implementing  digital  filters,  which  are,  in  turn, 
easily  adaptable  to  parallel  processing. 


In  this  paper  we  describe  some  algorithms  for  the  digital  filtering  of 
signals  and  images  which  have  been  realized  on  the  PS-2000  multiprocessor 
computation  complex  [MVK] .  We  report  some  calculations  which- demonstrate  the 
efficiency  of  the  processing  of  experimental  data. 


Brief  Description  of  the  PS-2000 

In  order  to  exploit  the  advantages  of  a  microprocessor  system  it  is  necessary 
to  consider  the  functional  capabilities  of  its  hardware  facilities.  Let  us 
examine  some  features  of  the  PS-2000  [1] . 

The  PS-2000  complex  falls  in  the  category  of  single-instruction-stream, 
multiple-data-stream  [SIMD]  computers.  It  includes  an  SM-2  computer,  which 
controls  the  system,  and  a  PS-2000  matrix  processor,  which  performs  the  basic 
data  processing.  The  matrix  processor  can  contain  up  to  64  uniform  processor 
elements  [PE] ,  which  act  upon  instructions  from  the  control  device  [UU] . 

Each  processor  element  has  its  own  working  memory  [OP]  of  26K  24-bit  words, 
and  an  arithmetic-and-logic  unit  [ALU],,  which  contains  16  general-purpose 
registers  that  perform  rapid  data  processing.  The  exchange  of  data  between 
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Table  1. 
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Figure  1. 


neighboring  processor  elements  is  transacted  along  regular  channels  [RK] , 
while  the  exchange  of  data  between  a  processor  elemient  and  the  control  ' 
element  is  transacted  along  buses  [MK] .  Figure  1  is  a  simplified  schematic 
diagram  of  the  PS-2000  complex.  The  number  p  of  processor  elements  operating 
in  the  system  can  be  a  multiple  of  eight;  i.e./  p  =  8,  16,  64.  All  of 

these  elements  synchronously  perform  computational  operations  on  data  stored 
in  the  general-purpose  registers  or  in  the  memory  layers  of  the  processor 
elements-  Each  register  and  memory  layer  consists  of  a  set  of  p  24-bit 
words.  Operations  with  vectors  of  length  p  are  carried  out  on  a  component 
basis  and  simultaneously:  ,  ^ 

(c^, . . Cp)  —  (a^, . . ap)  *  (b|, . . bp),  q  =  aj  *  bj,  i  =  1, . . ,,  p. 
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where  the  asterisk  (*)  is  some  arithmetic  or  logic  operation.  Various  shifts 
of  the  binary  digits  of  the  words  can  be  used  to  accelerate  multiplicatio 
operations.  In  the  exchange  of  data  between  processor  elements  the 
of  cyclic  permutation  is  used;  the  components  of  a  vector  are  moved  rig 
left.  For  example,  a  cyclic  permutation  of  the  vector  (a^,  ap)  three 
words  to  the  right  generates  a  vector  (ap.^,  ap.i,  ap,  a^,  a^,  ...,  ap-3)  .  n  e 

execution  of  certain  instructions,  only  some  of  the  processor  elements  may  be 
activated.  The  others  will  be  in  a  wait  state.  In  addition 
working  memory  of  the  processor  elements  (M)  there  is  a  memory 
used  for  scalar  instructions  and  constants. 


Table  1 


shows  some  characteristics  of  the  speed  of  the  system  in  the 

_ 3  I  -P  A 


execution  of  basic  arithmetic  instructions.  Here,  +,  and  +p  are  fixed-point 
and  floating-point  summation  instructions,  while  and  *p  are  corresponding 
multiplication  instructions. 

In  the  use  of  the  PS-2000  complex,  it  is  desirable  to  adhere  to  the  following 
recommendations:  1)  the  computation  algorithm  should  be  in  vector  form,  an 

the  length  of  the  vectors  should  be  proportional  to  the  number  of  processor 
elements;  2)  data  which  are  used  frequently  should  be  stored  in  registers;  3) 
computational  instructions,  memory  accessing  and  data  input  and  output  should 
be  alternated,  since  these  actions  can  be  carried  out  in  parallel;  and  4) 
division  instructions  should  be  eliminated  from  the  algorithm  to  the  greatest 

possible  extent. 


Real-Time  Signal  Processing 


In  the  processing  of  signals  (acoustic,  radar,  etc.),  digital  filters  often 
must  work  in  real  time.  Let  us  assume  that  signal  readings  f(n)  are  arriving 
continuously  at  a  certain  frequency  «>,  ...,  f(n  -  1),  f  (n) ,  f  (n  +  1)/  ••••  e 
task  is  to  apply  to  f  (n)  the  filter 


g(n)  =  Eka)f(n-/),  n  =  0,l,..v 

/=o 

with  a  given  pulsed  characteristic  k (n)  and  to  generate  a  sequence  of 
readings  g(n)  at  the  same  frequency. 

Let  us  examine  some  algorithms  for  the  real-time  processing  of  a  signal  f(n) 
by  filter  (1)  which  are  implemented  on  the  PS-2Q00. 

We  can  write  expression  (1)  in  an  equivalent  form. 

g(n)  =  Sk,(/)f(n,  +  0,  n  =  0,l,..., 

1=0 


XIQ 


where  ki(i)  =  k(N  -  i);  1  =  0,  N;  =  n  -  N. 

Let  us  assume  that  the  number  of  coefficients  in  pulsed  characteristic  k^{l) 
is  a  multiple  of  the  number  p,  N  +  1  =  pc,  where  1  <  c  <  7.  We  denote  by 
(i  =  1,  16)  the  general-purpose  registers,  while  {i  =  1,  16K)  are 

the  layers  of  the  working  memory  of  the  processor  element-  We  will  use  the 
following  instructions: 

1)  summation,  +  Rj,  and  multiplication,  R^  *  Rj,- 

2)  writing  of  the  value  of  R^  in  R.,  R^  ->  R^; 

3)  shifting  of  words  in  a  register  (cyclic  permutation) ; 

4)  writing  a  layer  of  memory  in  a  register,  Rj,- 

5)  sending  one  word  from  R^  to  R^; 

6)  input  and  output  of  one  word  from  a  register. 

The  data  processing  can  be  carried  out  in  either  the  integer-number  format 
(for  this  purpose,  the  real  coefficients  ki(n)  are  written  in  the  form 
m(n)/2”,  where  M  and  m(n)  are  integers,  with  n  =  0,  ...,  N)  or  the  real-number 
format-  The  integers  are  subsequently  converted  into  real  numbers - 

Let  us  assume  that  each  reading  is  processed  by  all  of  the  processor 
elements-  We  store  the  coefficients  ki(0),  ...,  k^  (N)  and  the  readings 
f(-N),  ...,  f  (0)  in  registers  R^,  ...,  R^  and  R^+i,  ...,  R2^,  respectively.  To 
calculate  the  value  of  g(0),  g(l),  ...  we  use 

Algorithm  1: 
l.n  =  0. 

2. 0  Rig,  i  =  1,  j  =  c  +  1,  /  =  1,  q  =  1, 

3.  Ri  (coefficients)  *  Rj  (readings)  +  R|6 

4.  If  1  <  c,  then  word  1  from  Rj^^,  else  f(n  + 1) word  1  of  Rj. 

5.  Shift  of  R|  one  word  to  the  left. 

6.  i  =  i  +  l,j  =  j  +  lifl<cin  step  3. 

7.  Rjg  +  (result  of  a  leftward  shift  of  I  words  in  R^^)  R^^. 

8.  If  q  <  log2p,  then  /  =  2/,  q  =  q  +  lm  step  7;  otherwise,  send  g(n)  from  word  1  in  R^^ 

9.  n  =  n  + 1  in  step  2. 

If  a  filter  with  a  large  number  of  coefficients  is  to  be  realized,  it  becomes 
necessary  to  use  the  memory  of  the  processor  elements.  There  will 
accordingly  be  large-scale  exchanges  between  the  memory  and  the  registers  and 
thus  an  increase  in  processing  time. 

The  second  algorithm  is  essentially  one  in  which  each  processor  element 
processes  its  own  reading.  We  introduce  groups  of  readings 

G(m)  =  {g(n), . . g(n  +  p  - 1)}  and  F(m)  =  {f(n - N), .. .,  f(n  - N  +  p  - 1)}, 

n  =  mp,  m  =  0, 1, ... 
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TO  calculate  the  values  of  group  G(m)  with  the  help  of  F  (m) ,  F(m  +  c)  we 
use 

AlgorithmZ: 

1.0-+Rv  /  =  0,  F(m)->]^.  . 

2. ki(/)-4R2. 

3.  R]  +  Rj  *  R3  — >  Rr 

4.  If /  =  N,  then  branch  to  step  7. 

5.  /  =  /  + 1;  in  word  1  of  register  R3  we  store  f(n  -  N  +  p  + /)• 

6.  Shift  R3  one  word  to  the  left  and  branch  to  step  2. 

7.  Extract  group  G(m)  from  R,. 

The  processing  of  G  (m  +  1)  requires  the  repeated  use  of  F  (m) ,  F(m  +  0)  and 
of  the  new  group  (Fm  +  c  +  1) .  The  storage  of  the  results  should  be 
organized  in  the  available  registers.  The  coefficients  k^d)  are  stored  in 
the  first  N  +  1  layers  of  S,,,  ■  (i  =  0,  N)  of  the  working  memory.  Each 
layer  of  contains  p  coefficients  k^{l)  . 

We  will  assume  below  that  the  signal. arrives  at  a  frequency  o)  =  48  kHz. 

This  figure  corresponds  to  the  accepted  standard  in  the  development  of 
devices  for  the  digital  processing  of  studio-link  acoustic  signals.  The 
algorithms  which  have  been  proposed  have  been  implemented  in  the  language  HPS 
for  the  PS-2000  processor  [1] .  Numerical  experiments  on  real-time  signal 
processing  were  carried  out  in  a  system  containing  16  processor  elements.  It 
was  established  as  a  result  that  algorithm  1  makes  it  possible  to  realize  a 
filter  which  operates  in  realtime  with  no  more  than  16  coefficients.  For 
algorithm  2,  the  calculations  were  carried  out  with  both  integer  and  real 
numbers.  The  results  are  summarized  in  table  2.  We  see  from  this  table  that 
a  filter  with  48  coefficients  can  cope  with  the  processing  of  signals 
arriving  at  a  rate  of  48  kHz  on  the  PS-2000  with  16  processor  elements. 


Implementation  of  Image-Processing  Algorithms 


A  nonrecursive  filter  which  acts  on  an  image  ftni,n2)  is  defined  by  a 
convolution  transformation: 

CVJ  ■ 

7I2)  =  /.:*/ (n^,  n.^)  =  S  h)fi’h  +  ^i>  +  h)y 

where  k(ni,n2)  is  the  pulsed  characteristic  of  the  filter.  We  denote  by 
K(§i,§3)  the  frequency  characteristic: 


(2) 


The  usual  task  is  to  process  an  image  f(ni,n,)  by  a  digital  filter  with  some 
pulsed  characteristic  k{ni,n2)  which  corresponds  to  a  given  function  K(4i,^2)' 
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Table  2< 


Number  of 
readings  lo  be 
processed 

Number  of 
coefficients 

Required 
time,  sec 

1  Expended  time,  sec 

Integer 

format 

Reel  format 

708  000 

32 

16 

12,35 

12^50 

1  024000 

32 

21 

10«36 

10,07 

708  000 

48 

10 

15,40 

1  152  000 

48 

24 

23,10 

1  530  000 

48 

32 

30,80 

1  ,  ■  ■  ’ 

An  effective  operation  of  a  combinational  assembly  is  frequently  used  in 
calculations  on  a  wide  class  of  two-dimensional  digital  filters  (bandpass 
filters,  rejection  filters,  inverse  filters,  etc.)  .  In  this  case  the 

pulsed  characteristic  is  written  in  the  form  , 

N  ^ 

”2)  =  S  V'’?  (”ii  “a)- 

Here,  bj  are  the  transformation  coefficients;  kj^(ni/n2)  is  a  convolution  of  a 
function  ko(ni,n2)  with  itself  j  times  (this  function  has  a  small  number  of 
coefficients;  for  example,  it  could  be  lumped  at  3  x  3  points  of  a  grid).  A 
decomposition  of  this  sort  of  the  function  k^  (n^/nj)  trtakes  it  possible  to 
significantly  reduce  the  number  of  computational  operations  during  the 
operation  of  a  digital  filter. 


Let  us  describe  some  algorithms  for  processing  images  with  digital  filters 
with  pulsed  characteristics  k(n;t/^2)  k^{n^,n2)  which  have  been  implemented 

on  the  PS-2000  multiprocesssor  computation  complex*  We  assume  that  the 
values  of  the  image  f  (ni,n2)  (n^  -  1,  Ml;  n2  1/  M2)  are  put  in  a 

matrix  A(M1,M2),  the  numbers  of  rows  and  columns  of  which  are  multiples  of 
the  number  of  processor  elements  p:  Ml  ^  Klp/  M2  “  K2pi  We  assume  that  the 
pulsed  characteristic  k(ni,n2)  spans  N1  X  N2  nodes  of  the  mesh:  n^  =  -n^,  ..., 

Hi,  n2  =  -n2,  ...,  Hg,  N1  «  2ni  +  1,  N2  -  2n2  +  1.  We  ate  to  calculate 

2  2  +  +  (4) 

at  the  points  Dj  =  1  +  n^,  Ml  -  n^;  nj  =  1  +  nj,  ...  M2  - 


g  («!,  Wa)  = 


We  write  the  matrix  A  in  the  memory  of  processor  elements  M.  In  the  memory 
of  the  first  processor  we  store  in  succession  the  rows  with  numbers  1,  K1 
+  N1  -  1;  in  that  of  the  second  processor  we  Store  the  rows  beginning  from 
K1  +  1  through  2K1  +  N1  -  1;  etc.  Processor  J  will  contain  rows  (J  -  1)K1  + 
1,  ...,  J,  K1  +  N1  -  1  (J  =  1,  ...,  p)  . 


The  coefficients  of  the  pulsed  characteristic  k(ni,n2)  are  written  in 
succession  in  scalar  memory  H  in  the  form  of  a  One-dimensional  vector  B  of 
length  N1  x  N2 : 
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B(i)  =  k(n],n2),  n,  =  -ni,  n2  =  -n2,  ...,n2, 

i  =  (n,  +  ni)N2  +  n2  +  n2  + 1. 

multiply  the  first  N2  layers  of  the  memory  by  the  elements  1,  N2  of 
vector  B,  and  we  sum  them  in  register  Bl.  Layers  with  numbers  M2  +  1,  M2 
+  N2  are  multiplied  by  the  next  coefficients,  B{B2  +  1),  B(2-N2),  and  they 

are  also  summed  with  Rl.  We  carry  out  corresponding  operations  with  layers 
2KM  +  1,  2KM  +  M2,  K  =  2,  ...,  Ml  -  1 .  As  a  result,  register  Rl  obtains  the 
values  of  the  image  g(ni,n2)  with  indices  n^  =  (J  -  1)K1  +  n^  +  1,  J  =  1,  ..., 
p,  n2  -  +  1. 

In  the  memory  of  processor  elements  M  we  carry  out  a  shift  one  layer 
downward,  and  we  repeat  the  operations  described  above.  In  this  manner  we 
calculate  the  elements  g(ni,n2),  with  indices  nj  =  (J  -  1)K1  +  n^  +  1,  J  =  1, 

M.,  p,  n2  ““  n2  t  2 . 

Repeating  this  sequence  of  actions  in  a  cycle  Ml  x  M2/p  times,  we  obtain  the 
other  values  of  the  image  g(ni,n2)  (see  (4)),  and  we  complete  the  processing. 
This  algorithm  obviously  allows  full  exploitation  of  the  processor  elements. 

On  the  basis  of  a  convolution  operation,  we  achieve  a  parallel-processing 
implementation  of  a  filter  with  a  pulsed  characteristic  ki(ni,n2),  calculated 
by  the  operation  of  a  combinational  assembly  (3) .  For  this  purpose,  it  is 
necessary  to  carry  out  a  convolution  of  the  image  f(nj^,n2)  with  the  function 
kgCn^fnj)  repeatedly  in  a  separate  part  of  working  memory  M.  The  first  Ml  x 
M2/p  layers  of  memory  are  used  to  store  the  result. 

Formally,  the  working  algorithm  of  filter  (4)  can  be  written  as  follows: 

l.n  =  l. 

2.0^R1,  1  =  0,  k  =  n-l,  i  =  l. 

3. f=l. 

4.  Rl  +  S(k  +  p  *  B(/ +  j)  ^  Rl. 

5. j  =  j  +  l,  if  j^N2mstep4. 

6.i  =  i  +  l,  if  i>Nlinstep8. 

7. /  =  /  +  N2,  k  =  k  +  M2mstep3. 

8. Rl->S(n),  n  =  n  +  l  instepZ 

{jxample.  Figure  2  shows  an  image  g(nj^,n2)  containing  275  x  512  readings,  which 
has  been  distorted  by  a  convolution  operation  as  in  (2)  with  an  instrumental 
function 


k(n|,n2)  —  kj  *  kj(n|,n2). 


where  ki(ni,n2)  was  obtained  through  a  McClellan  transformation  [2,  4]  of  the 
one-dimensional  step  function 

fl/5,  n  =  -2,  2; 

K  (”)  “■  |o  linage. 
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Figure  2. 
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The  frequency  characteristic  which  corresponds  to  k;L(ni,n2),  is  of 

varying  sign,  so  the  problem  of  reconstructing  the  image  f(ni,n2)  is  an 
ill-posed  problem.  As  a  first  step,  we  amplify  the  high  frequencies  by  using 
the  filter  g(ni,n2)  =  q  *  g(n^,n2)  with  the  pulsed  characteristic 


Figure  3. 
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To  calculate  the  pulsed  characteristic  k(ni,n2)  of  the  inverse  filter  we  use 
the  combinational  assembly  operation  [3]: 

11.)  =  S  n.))*’, 

i*o 


where  =  4,  =  -6,  b2  =  4,  b3  *-  -1- 

Figure  3  shows  the  reconstructed  image  f(ni,n2)  =  k  *  g(ni,n2).  The  time 
required  to  process  this  image  on  the  PS-2000  with  16  processor  elements  was 
5s.  By  way  of  comparison,  a  corresponding  processing  on  an  ES-1061  computer 

took  39.5  s. 
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